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Message from the Program Co-Chairs 


Welcome to OSDI ’10, the biggest OSDI yet, with 32 papers selected from an all-time high of 199 submissions. In 
approaching the task of chairing OSDI, we started with the explicit intention of accepting a larger set of papers, 
consistent with the growth in the field. Below we outline some of the rationale behind this goal, and the process we 
applied to achieve it. 


Computer systems research is growing as a community. We believe that progress on computer systems research is 
limited by manpower, not by the limits of a finite domain for interesting research. By implication, as the number 
of systems researchers increases, the volume of interesting research likely goes up as well. Year after year, top 
research programs add faculty or research positions in the systems area, while at the same time new programs es- 
tablish their presence in the field, including newfound growth outside the traditionally strong geographies. The ex- 
pansion of our community is consistent with the robust scientific and commercial application of computer systems 
research, providing a strong economic basis for this growth. We believe a larger OSDI program is an appropriate 
reflection of this growth in the systems community. 


We were also motivated by the challenge in making meaningful distinctions, under the pressure of program com- 
mittee deadlines, between papers that are almost accepted and those almost rejected. The fragility of PC decision 
process has been documented and discussed elsewhere [A08]. Too often, rejections seem arbitrary in retrospect, 
hinging on the nuances of a PC discussion rather than clear merit. In accepting more papers we hope to incremen- 
tally improve on the fragility of these decisions, while also building a program that is more diverse and therefore of 
broader interest. 


This goal of a larger program was a consideration throughout the review process. The PC was split into two groups: 
a “heavy” PC who participated in the first two rounds of reviewing, and a “heavier” PC who also reviewed papers 
in round three and attended a face-to-face meeting to decide final outcomes. In the first round, each paper received 
two reviews and approximately 35 papers were pruned. To reduce the risk of a premature pruning decision, we 
allowed reviewers to “rescue” a pruned paper by simply stating their support, with no discussion required. Each 
round-2 paper received three additional reviews. Another 80 or so papers were pruned after this round. This left us 
with a pool of 85 papers, each of which received two or three additional reviews in preparation for the PC meeting. 
After the second and third review rounds, borderline papers were discussed electronically by the reviewers and 
rejected by consensus of the reviewers. 


In the single-day, face-to-face PC meeting each remaining paper was presented by a reviewer, generally an advo- 
cate, followed by a time-limited discussion. Based on the first discussion, we binned each paper into one of four 
categories: “accept,” “acceptable,” “questionable,” and “reject.” No rejects were allowed in the first part of the day, 
the goal of this rule being to avoid the problem of a negative start leading to rejecting good papers early. When all 
papers had been discussed once, we briefly considered and then accepted the “acceptable” papers as a group, then 
began the difficult work of reconsidering the “questionable” papers. At the end of the meeting about 30 papers had 
been accepted. 


29 66 


In the days following the PC meeting, a small set of additional papers were accepted based on an email vote by the 
heavier PC members. While unusual, we justified this process based on our goal to create a larger and more inter- 
esting program, and a sentiment shared by many PC members that the PC discussion had not given due consider- 
ation to several of the best liked but most controversial papers. In retrospect we believe these late accepts allowed 
us to create a stronger and more interesting program, and we would encourage future PC chairs to plan an appro- 
priate process for thoughtful consideration of difficult papers after the bustle of the PC meeting has subsided. For 
example, even with a single-day PC meeting, it might make sense to put a small set of papers into an “overnight” 
category, allowing a broader collection of PC members to study them before a final decision the next day. 


Apart from the review process, we took some additional measures to try and get more reviews and reviewers in a 
mindset to accept. We encouraged positivity, following Hill and McKinley’s excellent advice [HM05]. We strictly 
applied conflict-of-interest rules, such that conflicted PC members were not given access to results for conflicted 


USENIX Association 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) vil 


papers until notifications had been sent to authors. We tried to lighten the PC load from papers that had no chance 
of acceptance, to leave more quality time for the remaining papers. 


Before we close we’d like to briefly acknowledge a few individuals who made a difference in our bringing this 
program to you. The USENIX staff was fantastic throughout the entire process. We also thank Eddie Kohler for his 
continued support of HotCRP, a truly wonderful piece of software. We also would like to acknowledge the program 
committee for their tireless efforts and thoughtful reviews, and Haryadi Gunawi for his detailed note-taking during 
the PC meeting. Finally, we would like to thank our families and the families of PC members for supporting (and 
tolerating!) the long hours required to do this kind of work. 


Thank you for attending OSDI ’10, and have a great conference! 


Remzi Arpaci-Dusseau, University of Wisconsin, Madison 
Brad Chen, Google 
OSDI ’10 Program Co-Chairs 
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An Analysis of Linux Scalability to Many Cores 
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M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich 
MIT CSAIL 


ABSTRACT 


This paper analyzes the scalability of seven system appli- 
cations (Exim, memcached, Apache, PostgreSQL, gmake, 
Psearchy, and MapReduce) running on Linux on a 48- 
core computer. Except for gmake, all applications trigger 
scalability bottlenecks inside a recent Linux kernel. Us- 
ing mostly standard parallel programming techniques— 
this paper introduces one new technique, sloppy coun- 
ters—these bottlenecks can be removed from the kernel 
or avoided by changing the applications slightly. Modify- 
ing the kernel required in total 3002 lines of code changes. 
A speculative conclusion from this analysis is that there 
is no scalability reason to give up on traditional operating 
system organizations just yet. 


1 INTRODUCTION 


There is a sense in the community that traditional kernel 
designs won’t scale well on multicore processors: that 
applications will spend an increasing fraction of their time 
in the kernel as the number of cores increases. Promi- 
nent researchers have advocated rethinking operating sys- 
tems [10, 28, 43] and new kernel designs intended to al- 
low scalability have been proposed (e.g., Barrelfish [11], 
Corey [15], and fos [53]). This paper asks whether tradi- 
tional kernel designs can be used and implemented in a 
way that allows applications to scale. 

This question is difficult to answer conclusively, but 
we attempt to shed a small amount of light on it. We 
analyze scaling a number of system applications on 
Linux running with a 48-core machine. We examine 
Linux because it has a traditional kernel design, and be- 
cause the Linux community has made great progress in 
making it scalable. The applications include the Exim 
mail server [2], memcached [3], Apache serving static 
files [1], PostgreSQL [4], gmake [23], the Psearchy file 
indexer [35, 48], and a multicore MapReduce library [38]. 
These applications, which we will refer to collectively 
as MOSBENCH, are designed for parallel execution and 
stress many major Linux kernel components. 

Our method for deciding whether the Linux kernel 
design is compatible with application scalability is as 
follows. First we measure scalability of the MOSBENCH 
applications on a recent Linux kernel (2.6.35-rc5, released 
July 12, 2010) with 48 cores, using the in-memory tmpfs 
file system to avoid disk bottlenecks. gmake scales well, 


but the other applications scale poorly, performing much 
less work per core with 48 cores than with one core. We 
attempt to understand and fix the scalability problems, by 
modifying either the applications or the Linux kernel. We 
then iterate, since fixing one scalability problem usually 
exposes further ones. The end result for each applica- 
tion is either good scalability on 48 cores, or attribution 
of non-scalability to a hard-to-fix problem with the ap- 
plication, the Linux kernel, or the underlying hardware. 
The analysis of whether the kernel design is compatible 
with scaling rests on the extent to which our changes to 
the Linux kernel turn out to be modest, and the extent 
to which hard-to-fix problems with the Linux kernel ulti- 
mately limit application scalability. 


As part of the analysis, we fixed three broad kinds of 
scalability problems for MOSBENCH applications: prob- 
lems caused by the Linux kernel implementation, prob- 
lems caused by the applications’ user-level design, and 
problems caused by the way the applications use Linux 
kernel services. Once we identified a bottleneck, it typi- 
cally required little work to remove or avoid it. In some 
cases we modified the application to be more parallel, or 
to use kernel services in a more scalable fashion, and in 
others we modified the kernel. The kernel changes are all 
localized, and typically involve avoiding locks and atomic 
instructions by organizing data structures in a distributed 
fashion to avoid unnecessary sharing. One reason the 
required changes are modest is that stock Linux already 
incorporates many modifications to improve scalability. 
More speculatively, perhaps it is the case that Linux’s 
system-call API is well suited to an implementation that 
avoids unnecessary contention over kernel objects. 


The main contributions of this paper are as follows. 
The first contribution is a set of 16 scalability improve- 
ments to the Linux 2.6.35-rc5 kernel, resulting in what we 
refer to as the patched kernel, PK. A few of the changes 
rely on a new idea, which we call sloppy counters, that 
has the nice property that it can be used to augment shared 
counters to make some uses more scalable without having 
to change all uses of the shared counter. This technique 
is particularly effective in Linux because typically only 
a few uses of a given shared counter are scalability bot- 
tlenecks; sloppy counters allow us to replace just those 
few uses without modifying the many other uses in the 
kernel. The second contribution is a set of application 
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benchmarks, MOSBENCH, to measure scalability of op- 
erating systems, which we make publicly available. The 
third is a description of the techniques required to im- 
prove the scalability of the MOSBENCH applications. Our 
final contribution is an analysis using MOSBENCH that 
suggests that there is no immediate scalability reason to 
give up on traditional kernel designs. 

The rest of the paper is organized as follows. Section 2 
relates this paper to previous work. Section 3 describes 
the applications in MOSBENCH and what operating sys- 
tem components they stress. Section 4 summarizes the 
differences between the stock and PK kernels. Section 5 
reports on the scalability of MOSBENCH on the stock 
Linux 2.6.35-rc5 kernel and the PK kernel. Section 6 
discusses the implications of the results. Section 7 sum- 
marizes this paper’s conclusions. 


2 RELATED WORK 


There is a long history of work in academia and industry 
to scale Unix-like operating systems on shared-memory 
multiprocessors. Research projects such as the Stanford 
FLASH [33] as well as companies such as IBM, Se- 
quent, SGI, and Sun have produced shared-memory ma- 
chines with tens to hundreds processors running variants 
of Unix. Many techniques have been invented to scale 
software for these machines, including scalable locking 
(e.g., [41]), wait-free synchronization (e.g., [27]), mul- 
tiprocessor schedulers (e.g., [8, 13, 30, 50]), memory 
management (e.g., [14, 19, 34, 52, 57]), and fast message 
passing using shared memory (e.g., [12, 47]). Textbooks 
have been written about adapting Unix for multiproces- 
sors (e.g., [46]). These techniques have been incorporated 
in current operating systems such as Linux, Mac OS X, 
Solaris, and Windows. Cantrill and Bonwick summarize 
the historical context and real-world experience [17]. 

This paper extends previous scalability studies by ex- 
amining a large set of systems applications, by using a 
48-core PC platform, and by detailing a particular set of 
problems and solutions in the context of Linux. These 
solutions follow the standard parallel programming tech- 
nique of factoring data structures so that each core can 
operate on separate data when sharing is not required, but 
such that cores can share data when necessary. 

Linux scalability improvements. Early multiproces- 
sor Linux kernels scaled poorly with kernel-intensive par- 
allel workloads because the kernel used coarse-granularity 
locks for simplicity. Since then the Linux commu- 
nity has redesigned many kernel subsystems to im- 
prove scalability (e.g., Read-Copy-Update (RCV) [39], 
local run queues [6], libnuma [31], and improved 
load-balancing support [37]). The Linux symposium 
(www. linuxsymposium.org) features papers related to 
scalability almost every year. Some of the redesigns are 
based on the above-mentioned research, and some com- 
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panies, such as IBM and SGI [16], have contributed code 
directly. Kleen provides a brief history of Linux kernel 
modifications for scaling and reports some areas of poor 
scalability in a recent Linux version (2.6.31) [32]. In this 
paper, we identify additional kernel scaling problems and 
describes how to address them. 

Linux scalability studies. Gough et al. study the scal- 
ability of Oracle Database 10g running on Linux 2.6.18 
on dual-core Intel Itanium processors [24]. The study 
finds problems with the Linux run queue, slab alloca- 
tor, and I/O processing. Cui et al. uses the TPCC-UVa 
and Sysbench-OLTP benchmarks with PostgreSQL to 
study the scalability of Linux 2.6.25 on an Intel 8-core 
system [56], and finds application-internal bottlenecks 
as well as poor kernel scalability in System V IPC. We 
find that these problems have either been recently fixed 
by the Linux community or are a consequence of fixable 
problems in PostgreSQL. 

Veal and Foong evaluate the scalability of Apache run- 
ning on Linux 2.6.20.3 on an 8-core AMD Opteron com- 
puter using SPECweb2005 [51]. They identify Linux scal- 
ing problems in the kernel implementations of scheduling 
and directory lookup, respectively. On a 48-core com- 
puter, we also observe directory lookup as a scalability 
problem and PK applies a number of techniques to ad- 
dress this bottleneck. Pesterev ef al. identify scalability 
problems in the Linux 2.6.30 network code using mem- 
cached and Apache [44]. The PK kernel addresses these 
problems by using a modern network card that supports a 
large number of virtual queues (similar to the approach 
taken by Route Bricks [21]). 

Cui et al. describe microbenchmarks for measuring 
multicore scalability and report results from running them 
on Linux on a 32-core machine [55]. They find a number 
of scalability problems in Linux (e.g., memory-mapped 
file creation and deletion). Memory-mapped files show 
up as a scalability problem in one MOSBENCH application 
when multiple threads run in the same address space with 
memory-mapped files. 

A number of new research operating systems use scal- 
ability problems in Linux as motivation. The Corey pa- 
per [15] identified bottlenecks in the Linux file descriptor 
and virtual memory management code caused by unneces- 
sary sharing. Both of these bottlenecks are also triggered 
by MOSBENCH applications. The Barrelfish paper [11] 
observed that Linux TLB shootdown scales poorly. This 
problem is not observed in the MOSBENCH applications. 
Using microbenchmarks, the fos paper [53] finds that the 
physical page allocator in Linux 2.6.24.7 does not scale 
beyond 8 cores and that executing the kernel and applica- 
tions on the same core results in cache interference and 
high miss rates. We find that the page allocator isn’t a 
bottleneck for MOSBENCH applications on 48 cores (even 
though they stress memory allocation), though we have 
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reason to believe it would be a problem with more cores. 
However, the problem appears to be avoidable by, for 
example, using super-pages or modifying the kernel to 
batch page allocation. 

Solaris scalability studies. Solaris provides a UNIX 
API and runs on SPARC-based and x86-based multi- 
core processors. Solaris incorporates SNZIs [22], which 
are similar to sloppy counters (see section 4.3). Tseng 
et al. report that SAP-SD, IBM Trade and several syn- 
thetic benchmarks scale well on an 8-core SPARC system 
running Solaris 10 [49]. Zou et al. encountered coarse 
grained locks in the UDP networking stack of Solaris 
10 that limited scalability of the OpenSER SIP proxy 
server on an 8-core SPARC system [29]. Using the mi- 
crobenchmarks mentioned above [55], Cui et al. compare 
FreeBSD, Linux, and Solaris [54], and find that Linux 
scales better on some microbenchmarks and Solaris scales 
better on others. We ran some of the MOSBENCH appli- 
cations on Solaris 10 on the 48-core machine used for 
this paper. While the Solaris license prohibits us from re- 
porting quantitative results, we observed similar or worse 
scaling behavior compared to Linux; however, we don’t 
know the causes or whether Solaris would perform better 
on SPARC hardware. We hope, however, that this paper 
helps others who might analyze Solaris. 


3 THE MOSBENCH APPLICATIONS 


To stress the kernel we chose two sets of applications: 
1) applications that previous work has shown not to 
scale well on Linux (memcached; Apache; and Metis, a 
MapReduce library); and 2) applications that are designed 
for parallel execution and are kernel intensive (gmake, 
PostgreSQL, Exim, and Psearchy). Because many ap- 
plications are bottlenecked by disk writes, we used an 
in-memory tmpfs file system to explore non-disk limita- 
tions. We drive some of the applications with synthetic 
user workloads designed to cause them to use the ker- 
nel intensively, with realism a secondary consideration. 
This collection of applications stresses important parts 
of many kernel components (e.g., the network stack, file 
name cache, page cache, memory manager, process man- 
ager, and scheduler). Most spend a significant fraction 
of their CPU time in the kernel when run on a single 
core. All but one encountered serious scaling problems 
at 48 cores caused by the stock Linux kernel. The rest of 
this section describes the selected applications, how they 
are parallelized, and what kernel services they stress. 


3.1 Mail server 


Exim [2] is a mail server. We operate it in a mode where 
a single master process listens for incoming SMTP con- 
nections via TCP and forks a new process for each con- 
nection, which in turn accepts the incoming mail, queues 
it in a shared set of spool directories, appends it to the 


per-user mail file, deletes the spooled mail, and records 
the delivery in a shared log file. Each per-connection pro- 
cess also forks twice to deliver each message. With many 
concurrent client connections, Exim has a good deal of 
parallelism. It spends 69% of its time in the kernel on 
a single core, stressing process creation and small file 
creation and deletion. 


3.2 Object cache 


memcached [3] is an in-memory key-value store often 
used to improve web application performance. A single 
memcached server running on multiple cores is bottle- 
necked by an internal lock that protects the key-value hash 
table. To avoid this problem, we run multiple memcached 
servers, each on its own port, and have clients determin- 
istically distribute key lookups among the servers. This 
organization allows the servers to process requests in par- 
allel. When request sizes are small, memcached mainly 
stresses the network stack, spending 80% of its time pro- 
cessing packets in the kernel at one core. 


3.3. Web server 


Apache [1] is a popular Web server, which previous work 
(e.g., [51]) has used to study Linux scalability. We run a 
single instance of Apache listening on port 80. We config- 
ure this instance to run one process per core. Each process 
has a thread pool to service connections; one thread is 
dedicated to accepting incoming connections while the 
other threads process the connections. In addition to the 
network stack, this configuration stresses the file system 
(in particular directory name lookup) because it stats and 
opens a file on every request. Running on a single core, 
an Apache process spends 60% of its execution time in 
the kernel. 


3.4 Database 


PostgreSQL [4] is a popular open source SQL database, 
which, unlike many of our other workloads, makes exten- 
sive internal use of shared data structures and synchro- 
nization. PostgreSQL also stresses many shared resources 
in the kernel: it stores database tables as regular files 
accessed concurrently by all PostgreSQL processes, it 
starts one process per connection, it makes use of kernel 
locking interfaces to synchronize and load balance these 
processes, and it communicates with clients over TCP 
sockets that share the network interface. 

Ideally, PostgreSQL would scale well for read-mostly 
workloads, despite its inherent synchronization needs. 
PostgreSQL relies on snapshot isolation, a form of opti- 
mistic concurrency control that avoids most read locks. 
Furthermore, most write operations acquire only row- 
level locks exclusively and acquire all coarser-grained 
locks in shared modes. Thus, in principle, PostgreSQL 
should exhibit little contention for read-mostly workloads. 
In practice, PostgreSQL is limited by bottlenecks in both 
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its own code and in the kernel. For a read-only work- 
load that avoids most application bottlenecks, PostgreSQL 
spends only 1.5% of its time in the kernel with one core, 
but this grows to 82% with 48 cores. 


3.5 Parallel build 


gmake [23] is an implementation of the standard make 
utility that supports executing independent build rules 
concurrently. gmake is the unofficial default benchmark 
in the Linux community since all developers use it to 
build the Linux kernel. Indeed, many Linux patches 
include comments like “This speeds up compiling the 
kernel.” We benchmarked gmake by building the stock 
Linux 2.6.35-re5 kernel with the default configuration 
for x86_64. gmake creates more processes than there are 
cores, and reads and writes many files. The execution 
time of gmake is dominated by the compiler it runs, but 
system time is not negligible: with one core, 7.6% of the 
execution time is system time. 


3.6 File indexer 


Psearchy is a parallel version of searchy [35, 48], a pro- 
gram to index and query Web pages. We focus on the 
indexing component of searchy because it is more system 
intensive. Our parallel version, pedsort, runs the searchy 
indexer on each core, sharing a work queue of input files. 
Each core operates in two phases. In phase 1, it pulls input 
files off the work queue, reading each file and recording 
the positions of each word in a per-core hash table. When 
the hash table reaches a fixed size limit, it sorts it alpha- 
betically, flushes it to an intermediate index on disk, and 
continues processing input files. Phase 1 is both compute 
intensive (looking up words in the hash table and sorting 
it) and file-system intensive (reading input files and flush- 
ing the hash table). To avoid stragglers in phase 1, the 
initial work queue is sorted so large files are processed 
first. Once the work queue is empty, each core merges 
the intermediate index files it produced, concatenating the 
position lists of words that appear in multiple intermedi- 
ate indexes, and generates a binary file that records the 
positions of each word and a sequence of Berkeley DB 
files that map each word to its byte offset in the binary 
file. To simplify the scalability analysis, each core starts 
a new Berkeley DB every 200,000 entries, eliminating 
a logarithmic factor and making the aggregate work per- 
formed by the indexer constant regardless of the number 
of cores. Unlike phase 1, phase 2 is mostly file-system 
intensive. While pedsort spends only 1.9% of its time 
in the kernel at one core, this grows to 23% at 48 cores, 
indicating scalability limitations. 


3.7 MapReduce 


Metis is a MapReduce [20] library for single multicore 
servers inspired by Phoenix [45]. We use Metis with an 
application that generates inverted indices. This workload 
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allocates large amounts of memory to hold temporary 
tables, stressing the kernel memory allocator and soft page 
fault code. This workload spends 3% of its runtime in the 
kernel with one core, but this rises to 16% at 48 cores. 


4 KERNEL OPTIMIZATIONS 


The MOSBENCH applications trigger a few scalability 
bottlenecks in the kernel. We describe the bottlenecks 
and our solutions here, before presenting detailed per- 
application scaling results in Section 5, because many 
of the bottlenecks are common to multiple applications. 
Figure | summarizes the bottlenecks. Some of these prob- 
lems have been discussed on the Linux kernel mailing 
list and solutions proposed; perhaps the reason these solu- 
tions have not been implemented in the standard kernel is 
that the problems are not acute on small-scale SMPs or 
are masked by I/O delays in many applications. Figure | 
also summarizes our solution for each bottleneck. 


4.1 Scalability tutorial 


Why might one expect performance to scale well with the 
number of cores? If a workload consists of an unlimited 
supply of tasks that do not interact, then you’d expect to 
get linear increases in total throughput by adding cores 
and running tasks in parallel. In real life parallel tasks 
usually interact, and interaction usually forces serial ex- 
ecution. Amdahl’s Law summarizes the result: however 
small the serial portion, it will eventually prevent added 
cores from increasing performance. For example, if 25% 
of a program is serial (perhaps inside some global locks), 
then any number of cores can provide no more than 4- 
times speedup. 

Here are a few types of serializing interactions that 
the MOSBENCH applications encountered. These are all 
classic considerations in parallel programming, and are 
discussed in previous work such as [17]. 


The tasks may lock a shared data structure, so that 
increasing the number of cores increases the lock 
wait time. 


The tasks may write a shared memory location, so 
that increasing the number of cores increases the 
time spent waiting for the cache coherence proto- 
col to fetch the cache line in exclusive mode. This 
problem can occur even in lock-free shared data 
structures. 


The tasks may compete for space in a limited-size 
shared hardware cache, so that increasing the number 
of cores increases the cache miss rate. This problem 
can occur even if tasks never share memory. 


The tasks may compete for other shared hardware 
resources such as inter-core interconnect or DRAM 


USENIX Association 


USENIX Association 


Parallel accept 


Apache 
































Concurrent accept system calls contend on shared socket fields. = User per-core backlog queues for listening sockets. 

dentry reference counting Apache, Exim 
File name resolution contends on directory entry reference counts. | => Use sloppy counters to reference count directory entry objects. 

Mount point (vfsmount) reference counting Apache, Exim 
Walking file name paths contends on mount point reference counts. => Use sloppy counters for mount point objects. 

IP packet destination (dst_entry) reference counting memcached, Apache 
IP packet transmission contends on routing table entries. => Use sloppy counters for IP routing table entries. 

Protocol memory usage tracking memcached, Apache 
Cores contend on counters for tracking protocol memory consumption. = Use sloppy counters for protocol usage counting. 

Acquiring directory entry (dentry) spin locks Apache, Exim 
Walking file name paths contends on per-directory entry spin locks. => Use a lock-free protocol in dlookup for checking filename matches. 

Mount point table spin lock Apache, Exim 
Resolving path names to mount points contends on a global spin lock. = Use per-core mount table caches. 

Adding files to the open list Apache, Exim 
Cores contend on a per-super block list that tracks open files. => Use per-core open file lists for each super block that has open files. 
Allocating DMA buffers memcached, Apache 

DMA memory allocations contend on the memory node 0 spin lock. = Allocate Ethernet device DMA buffers from the local memory node. 


False sharing in net_device and device 


memcached, Apache, PostgreSQL 





False sharing causes contention for read-only structure fields. 


False sharing in page 


Place read-only fields on their own cache lines. 


Exim 





False sharing causes contention for read-mostly structure fields. 


inode lists 


Place read-only fields on their own cache lines. 


memcached, Apache 





Cores contend on global locks protecting lists used to track inodes. 


Deache lists 


=> 


Avoid acquiring the locks when not necessary. 


memcached, Apache 





Cores contend on global locks protecting lists used to track dentrys. 


Per-inode mutex 


=> 


Avoid acquiring the locks when not necessary. 
PostgreSQL 








Cores contend on a per-inode mutex in 1seek. => Use atomic reads to eliminate the need to acquire the mutex. 
Super-page fine grained locking Metis 

Super-page soft page faults contend on a per-process mutex. => Protect each super-page memory mapping with its own mutex. 
Zeroing super-pages Metis 





Zeroing super-pages flushes the contents of on-chip caches. 


=> 


Use non-caching instructions to zero the contents of super-pages. 


Figure 1: A summary of Linux scalability problems encountered by MOSBENCH applications and their corresponding fixes. The fixes add 2617 lines 


of code to Linux and remove 385 lines of code from Linux. 


interfaces, so that additional cores spend their time 
waiting for those resources rather than computing. 


e There may be too few tasks to keep all cores busy, 
so that increasing the number of cores leads to more 
idle cores. 


Many scaling problems manifest themselves as delays 
caused by cache misses when a core uses data that other 
cores have written. This is the usual symptom both for 
lock contention and for contention on lock-free mutable 
data. The details depend on the hardware cache coherence 
protocol, but the following is typical. Each core has a 
data cache for its own use. When a core writes data that 
other cores have cached, the cache coherence protocol 
forces the write to wait while the protocol finds the cached 
copies and invalidates them. When a core reads data 
that another core has just written, the cache coherence 
protocol doesn’t return the data until it finds the cache that 
holds the modified data, annotates that cache to indicate 
there is a copy of the data, and fetches the data to the 
reading core. These operations take about the same time 
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as loading data from off-chip RAM (hundreds of cycles), 
so sharing mutable data can have a disproportionate effect 
on performance. 

Exercising the cache coherence machinery by modify- 
ing shared data can produce two kinds of scaling problems. 
First, the cache coherence protocol serializes modifica- 
tions to the same cache line, which can prevent parallel 
speedup. Second, in extreme cases the protocol may 
saturate the inter-core interconnect, again preventing addi- 
tional cores from providing additional performance. Thus 
good performance and scalability often demand that data 
be structured so that each item of mutable data is used by 
only one core. 

In many cases scaling bottlenecks limit performance 
to some maximum, regardless of the number of cores. In 
other cases total throughput decreases as the number of 
cores grows, because each waiting core slows down the 
cores that are making progress. For example, non-scalable 
spin locks produce per-acquire interconnect traffic that is 
proportional to the number of waiting cores; this traffic 
may slow down the core that holds the lock by an amount 
proportional to the number of waiting cores [41]. Acquir- 
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ing a Linux spin lock takes a few cycles if the acquiring 
core was the previous lock holder, takes a few hundred 
cycles if another core last held the lock and there is no 
contention, and are not scalable under contention. 

Performance is often the enemy of scaling. One way 
to achieve scalability is to use inefficient algorithms, so 
that each core busily computes and makes little use of 
shared resources such as locks. Conversely, increasing 
the efficiency of software often makes it less scalable, by 
increasing the fraction of time it uses shared resources. 
This effect occurred many times in our investigations of 
MOSBENCH application scalability. 

Some scaling bottlenecks cannot easily be fixed, be- 
cause the semantics of the shared resource require serial 
access. However, it is often the case that the implementa- 
tion can be changed so that cores do not have to wait for 
each other. For example, in the stock Linux kernel the set 
of runnable threads is partitioned into mostly-private per- 
core scheduling queues; in the common case, each core 
only reads, writes, and locks its own queue [36]. Many 
scaling modifications to Linux follow this general pattern. 

Many of our scaling modifications follow this same 
pattern, avoiding both contention for locks and contention 
for the underlying data. We solved other problems using 
well-known techniques such as lock-free protocols or fine- 
grained locking. In all cases we were able to eliminate 
scaling bottlenecks with only local changes to the kernel 
code. The following subsections explain our techniques. 


4.2 Multicore packet processing 


The Linux network stack connects different stages of 
packet processing with queues. A received packet typ- 
ically passes through multiple queues before finally ar- 
riving at a per-socket queue, from which the application 
reads it with a system call like read or accept. Good 
performance with many cores and many independent net- 
work connections demands that each packet, queue, and 
connection be handled by just one core [21, 42]. This 
avoids inter-core cache misses and queue locking costs. 
Recent Linux kernels take advantage of network cards 
with multiple hardware queues, such as Intel’s 82599 
10Gbit Ethernet (IXGBE) card, or use software tech- 
niques, such as Receive Packet Steering [26] and Receive 
Flow Steering [25], to attempt to achieve this property. 
With a multi-queue card, Linux can be configured to as- 
sign each hardware queue to a different core. Transmit 
scaling is then easy: Linux simply places outgoing pack- 
ets on the hardware queue associated with the current 
core. For incoming packets, such network cards provide 
an interface to configure the hardware to enqueue incom- 
ing packets matching a particular criteria (e.g., source IP 
address and port number) on a specific queue and thus 
to a particular core. This spreads packet processing load 
across cores. However, the IXGBE driver goes further: 
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for each core, it samples every 20" outgoing TCP packet 
and updates the hardware’s flow directing tables to de- 
liver further incoming packets from that TCP connection 
directly to the core. 


This design typically performs well for long-lived con- 
nections, but poorly for short ones. Because the technique 
is based on sampling, it is likely that the majority of 
packets on a given short connection will be misdirected, 
causing cache misses as Linux delivers to the socket on 
one core while the socket is used on another. Furthermore, 
because few packets are received per short-lived connec- 
tion, misdirecting even the initial handshake packet of a 
connection imposes a significant cost. 


For applications like Apache that simultaneously ac- 
cept connections on all cores from the same listening 
socket, we address this problem by allowing the hard- 
ware to determine which core and thus which application 
thread will handle an incoming connection. We modify 
accept to prefer connections delivered to the local core’s 
queue. Then, if the application processes the connection 
on the same core that accepted it (as in Apache), all pro- 
cessing for that connection will remain entirely on one 
core. Our solution has the added benefit of addressing 
contention on the lock that protects the single listening 
socket’s connection backlog queue. 

To implement this, we configured the IXGBE to direct 
each packet to a queue (and thus core) using a hash of the 
packet headers designed to deliver all of a connection’s 
packets (including the TCP handshake packets) to the 
same core. We then modified the code that handles TCP 
connection setup requests to queue requests on a per-core 
backlog queue for the listening socket, so that a thread 
will accept and process connections that the IXGBE di- 
rects to the core running that thread. If accept finds the 
current core’s backlog queue empty, it attempts to steal 
a connection request from a different core’s queue. This 
arrangement provides high performance for short connec- 
tions by processing each connection entirely on one core. 
If threads were to move from core to core while handling 
a single connection, a combination of this technique and 
the current sampling approach might be best. 


4.3 Sloppy counters 


Linux uses shared counters for reference-counted garbage 
collection and to manage various resources. These coun- 
ters can become bottlenecks if many cores update them. 
In these cases lock-free atomic increment and decrement 
instructions do not help, because the coherence hardware 
serializes the operations on a given counter. 

The MOSBENCH applications encountered bottle- 
necks from reference counts on directory entry objects 
(dentrys), mounted file system objects (vfsmounts), net- 
work routing table entries (dst_entrys), and counters 
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Figure 2: An example of the kernel using a sloppy counter for dentry 
reference counting. A large circle represents a local counter, and a gray 
dot represents a held reference. In this figure, a thread on core 0 first 
acquires a reference from the central counter. When the thread releases 
this reference, it adds the reference to the local counter. Finally, another 
thread on core 0 is able to acquire the spare reference without touching 
the central counter. 


tracking the amount of memory allocated by each net- 
work protocol (such as TCP or UDP). 

Our solution, which we call sloppy counters, builds on 
the intuition that each core can hold a few spare references 
to an object, in hopes that it can give ownership of these 
references to threads running on that core, without having 
to modify the global reference count. More concretely, 
a sloppy counter represents one logical counter as a sin- 
gle shared central counter and a set of per-core counts 
of spare references. When a core increments a sloppy 
counter by V, it first tries to acquire a spare reference 
by decrementing its per-core counter by V. If the per- 
core counter is greater than or equal to V, meaning there 
are sufficient local references, the decrement succeeds. 
Otherwise the core must acquire the references from the 
central counter, so it increments the shared counter by 
V. When a core decrements a sloppy counter by V, it 
releases these references as local spare references, incre- 
menting its per-core counter by V. Figure 2 illustrates 
incrementing and decrementing a sloppy counter. If the 
local count grows above some threshold, spare references 
are released by decrementing both the per-core count and 
the central count. 

Sloppy counters maintain the invariant that the sum 
of per-core counters and the number of resources in use 
equals the value in the shared counter. For example, a 
shared dentry reference counter equals the sum of the 
per-core counters and the number of references to the 
dentry currently in use. 

A core usually updates a sloppy counter by modifying 
its per-core counter, an operation which typically only 
needs to touch data in the core’s local cache (no waiting 
for locks or cache-coherence serialization). 

We added sloppy counters to count references to 
dentrys, vfsmounts, and dst_entrys, and used sloppy 
counters to track the amount of memory allocated by 
each network protocol (such as TCP and UDP). Only 


uses of a counter that cause contention need to be mod- 
ified, since sloppy counters are backwards-compatible 
with existing shared-counter code. The kernel code that 
creates a sloppy counter allocates the per-core counters. 
It is occasionally necessary to reconcile the central and 
per-core counters, for example when deciding whether an 
object can be de-allocated. This operation is expensive, 
so sloppy counters should only be used for objects that 
are relatively infrequently de-allocated. 

Sloppy counters are similar to Scalable NonZero Indi- 
cators (SNZD) [22], distributed counters [9], and approxi- 
mate counters [5]. All of these techniques speed up incre- 
ment/decrement by use of per-core counters, and require 
significantly more work to find the true total value. Sloppy 
counters are attractive when one wishes to improve the 
performance of some uses of an existing counter without 
having to modify all points in the code where the counter 
is used. A limitation of sloppy counters is that they use 
space proportional to the number of cores. 


4.4 Lock-free comparison 


We found situations in which MOSBENCH applications 
were bottlenecked by low scalability for name lookups 
in the directory entry cache. The directory entry cache 
speeds up lookups by mapping a directory and a file name 
to a dentry identifying the target file’s inode. When 
a potential dentry is located, the lookup code acquires 
a per-dentry spin lock to atomically compare several 
fields of the dentry with the arguments of the lookup 
function. Even though the directory cache has been op- 
timized using RCU for scalability [40], the dentry spin 
lock for common parent directories, such as /usr, was 
sometimes a bottleneck even if the path names ultimately 
referred to different files. 

We optimized dentry comparisons using a lock-free 
protocol similar to Linux’ lock-free page cache lookup 
protocol [18]. The lock-free protocol uses a generation 
counter, which the PK kernel increments after every mod- 
ification to a directory entry (e.g.,mv foo bar). During 
a modification (when the dentry spin lock is held), PK 
temporarily sets the generation counter to 0. The PK ker- 
nel compares dentry fields to the arguments using the 
following procedure for atomicity: 


e If the generation counter is 0, fall back to the lock- 
ing protocol. Otherwise remember the value of the 
generation counter. 


e Copy the fields of the dentry to local variables. If 
the generation afterwards differs from the remem- 
bered value, fall back to the locking protocol. 


e Compare the copied fields to the arguments. If there 
is a match, increment the reference count unless it is 
0, and return the dentry. If the reference count is 0, 
fall back to the locking protocol. 
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The lock-free protocol improves scalability because it 
allows cores to perform lookups for the same directory 
entries without serializing. 


4.5 Per-core data structures 


We encountered three kernel data structures that caused 
scaling bottlenecks due to lock contention: a per-super- 
block list of open files that determines whether a read- 
write file system can be remounted read-only, a table of 
mount points used during path lookup, and the pool of 
free packet buffers. Though each of these bottlenecks is 
caused by lock contention, bottlenecks would remain if 
we replaced the locks with finer grained locks or a lock 
free protocol, because multiple cores update the data struc- 
tures. Therefore our solutions refactor the data structures 
so that in the common case each core uses different data. 

We split the per-super-block list of open files into per- 
core lists. When a process opens a file the kernel locks 
the current core’s list and adds the file. In most cases 
a process closes the file on the same core it opened it 
on. However, the process might have migrated to another 
core, in which case the file must be expensively removed 
from the list of the original core. When the kernel checks 
if a file system can be remounted read-only it must lock 
and scan all cores’ lists. 

We also added per-core vfsmount tables, each acting 
as a cache for a central vfsmount table. When the kernel 
needs to look up the vfsmount for a path, it first looks in 
the current core’s table, then the central table. If the latter 
succeeds, the result is added to the per-core table. 

Finally, the default Linux policy for machines with 
NUMA memory is to allocate packet buffers (skbuffs) 
from a single free list in the memory system closest to the 
I/O bus. This caused contention for the lock protecting 
the free list. We solved this using per-core free lists. 


4.6 Eliminating false sharing 


We found some MOSBENCH applications caused false 
sharing in the kernel. In the cases we identified, the ker- 
nel located a variable it updated often on the same cache 
line as a variable it read often. The result was that cores 
contended for the falsely shared line, limiting scalabil- 
ity. Exim per-core performance degraded because of false 
sharing of physical page reference counts and flags, which 
the kernel located on the same cache line of a page vari- 
able. memcached, Apache, and PostgreSQL faced simi- 
lar false sharing problems with net_device and device 
variables. In all cases, placing the heavily modified data 
on a separate cache line improved scalability. 


4.7 Avoiding unnecessary locking 


For small numbers of cores, lock contention in Linux 
does not limit scalability for MOSBENCH applications. 
With more than 16 cores, the scalability of memcached, 
Apache, PostgreSQL, and Metis are limited by waiting for 
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Figure 3: MOSBENCH results summary. Each bar shows the ratio of 
per-core throughput with 48 cores to throughput on one core, with 1.0 
indicating perfect scalability. Each pair of bars corresponds to one 
application before and after our kernel and application modifications. 


and acquiring spin locks and mutexes! in the file system 
and virtual memory management code. In many cases we 
were able to eliminate acquisitions of the locks altogether 
by modifying the code to detect special cases when ac- 
quiring the locks was unnecessary. In one case, we split 
a mutex protecting all the super page mappings into one 
mutex per mapping. 


5 EVALUATION 


This section evaluates the MOSBENCH applications on 
the most recent Linux kernel at the time of writing 
(Linux 2.6.35-re5, released on July 12, 2010) and our 
modified version of this kernel, PK. For each applica- 
tion, we describe how the stock kernel limits scalability, 
and how we addressed the bottlenecks by modifying the 
application and taking advantage of the PK changes. 

Figure 3 summarizes the results of the MOSBENCH 
benchmark, comparing application scalability before and 
after our modifications. A bar with height 1.0 indicates 
perfect scalability (48 cores yielding a speedup of 48). 
Most of the applications scale significantly better with 
our modifications. All of them fall short of perfect scal- 
ability even with those modifications. As the rest of this 
section explains, the remaining scalability bottlenecks are 
not the fault of the kernel. Instead, they are caused by 
non-parallelizable components in the application or un- 
derlying hardware: resources that the application’s design 
requires it to share, imperfect load balance, or hardware 
bottlenecks such as the memory system or the network 
card. For this reason, we conclude that the Linux ker- 
nel with our modifications is consistent with MOSBENCH 
scalability up to 48 cores. 

For each application we show scalability plots in the 
same format, which shows throughput per core (see, for 
example, Figure 4). A horizontal line indicates perfect 


1A thread initially busy waits to acquire a mutex, but if the wait time 
is long the thread yields the CPU. 
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scalability: each core contributes the same amount of 
work regardless of the total number of cores. In practice 
one cannot expect a truly horizontal line: a single core 
usually performs disproportionately well because there 
is no inter-core sharing and because Linux uses a stream- 
lined lock scheme with just one core, and the per-chip 
caches become less effective as more active cores share 
them. For most applications we see the stock kernel’s line 
drop sharply because of kernel bottlenecks, and the PK 
line drop more modestly. 


5.1 Method 


We run the applications that modify files on a tmpfs in- 
memory file system to avoid waiting for disk I/O. The 
result is that MOSBENCH stresses the kernel more it would 
if it had to wait for the disk, but that the results are not 
representative of how the applications would perform 
in a real deployment. For example, a real mail server 
would probably be bottlenecked by the need to write each 
message durably to a hard disk. The purpose of these 
experiments is to evaluate the Linux kernel’s multicore 
performance, using the applications to generate a reason- 
ably realistic mix of system calls. 

We run experiments on a 48-core machine, with a Tyan 
Thunder $4985 board and an M4985 quad CPU daughter- 
board. The machine has a total of eight 2.4 GHz 6-core 
AMD Opteron 8431 chips. Each core has private 64 Kbyte 
instruction and data caches, and a 512 Kbyte private L2 
cache. The cores on each chip share a 6 Mbyte L3 cache, 
1 Mbyte of which is used for the HT Assist probe fil- 
ter [7]. Each chip has 8 Gbyte of local off-chip DRAM. 
A core can access its L1 cache in 3 cycles, its L2 cache in 
14 cycles, and the shared on-chip L3 cache in 28 cycles. 
DRAM access latencies vary, from 122 cycles for a core 
to read from its local DRAM to 503 cycles for a core to 
read from the DRAM of the chip farthest from it on the 
interconnect. The machine has a dual-port Intel 82599 
10Gbit Ethernet (IXGBE) card, though we use only one 
port for all experiments. That port connects to an Ethernet 
switch with a set of load-generating client machines. 

Experiments that use fewer than 48 cores run with 
the other cores entirely disabled. memcached, Apache, 
Psearchy, and Metis pin threads to cores; the other ap- 
plications do not. We run each experiment 3 times and 
show the best throughput, in order to filter out unrelated 
activity; we found the variation to be small. 


5.2. Exim 


To measure the performance of Exim 4.71, we configure 
Exim to use tmpfs for all mutable files—spool files, log 
files, and user mail files—and disable DNS and RFC1413 
lookups. Clients run on the same machine as Exim. Each 
repeatedly opens an SMTP connection to Exim, sends 10 
separate 20-byte messages to a local user, and closes the 
SMTP connection. Sending 10 messages per connection 
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Figure 4: Exim throughput and runtime breakdown. 


prevents exhaustion of TCP client port numbers. Each 
client sends to a different user to prevent contention on 
user mail files. We use 96 client processes regardless of 
the number of active cores; as long as there are enough 
clients to keep Exim busy, the number of clients has little 
effect on performance. 

We modified and configured Exim to increase perfor- 
mance on both the stock and PK kernels: 


e Berkeley DB v4.6 reads /proc/stat to find the number 
of cores. This consumed about 20% of the total run- 
time, so we modified Berkeley DB to aggressively 
cache this information. 


e We configured Exim to split incoming queued mes- 
sages across 62 spool directories, hashing by the 
per-connection process ID. This improves scala- 
bility because delivery processes are less likely to 
create files in the same directory, which decreases 
contention on the directory metadata in the kernel. 


e We configured Exim to avoid an exec() per mail 
message, using deliver_drop_privilege. 


Figure 4 shows the number of messages Exim can pro- 
cess per second on each core, as the number of cores 
varies. The stock and PK kernels perform nearly the 
same on one core. As the number of cores increases, the 
per-core throughput of the stock kernel eventually drops 
toward zero. The primary cause of the throughput drop 
is contention on a non-scalable kernel spin lock that se- 
rializes access to the vfsmount table. Exim causes the 
kernel to access the vfsmount table dozens of times for 
each message. Exim on PK scales significantly better, 
owing primarily to improvements to the vfsmount ta- 
ble (Section 4.5) and the changes to the dentry cache 
(Section 4.4). 

Throughput on the PK kernel degrades from one to 
two cores, while the system time increases, because of 
the many kernel data structures that are not shared with 
one core but must be shared (with cache misses) with 
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two cores. The throughput on the PK kernel continues 
to degrade; however, this is mainly due to application- 
induced contention on the per-directory locks protecting 
file creation in the spool directories. As the number of 
cores increases, there is an increasing probability that 
Exim processes running on different cores will choose the 
same spool directory, resulting in the observed contention. 

We foresee a potential bottleneck on more cores due 
to cache misses when a per-connection process and the 
delivery process it forks run on different cores. When 
this happens the delivery process suffers caches misses 
when it first accesses kernel data—especially data related 
to virtual address mappings—that its parent initialized. 
The result is that process destruction, which frees virtual 
address mappings, and soft page fault handling, which 
reads virtual address mappings, execute more slowly with 
more cores. For the Exim configuration we use, however, 
this slow down is negligible compared to slow down that 
results from contention on spool directories. 


5.3. memcached 


We run a separate memcached 1.4.4 process on each 
core to avoid application lock contention. Each server is 
pinned to a separate core and has its own UDP port. Each 
client thread repeatedly queries a particular memcached 
instance for a non-existent key because this places higher 
load on the kernel than querying for existing keys. There 
are a total of 792 client threads running on 22 client 
machines. Requests are 68 bytes, and responses are 64. 
Each client thread sends a batch of 20 requests and waits 
for the responses, timing out after 100 ms in case packets 
are lost. 

For both kernels, we use a separate hardware receive 
and transmit queue for each core and configure the 
IXGBE to inspect the port number in each incoming 
packet header, place the packet on the queue dedicated to 
the associated memcached’s core, and deliver the receive 
interrupt to that core. 

Figure 5 shows that memcached does not scale well on 
the stock Linux kernel. 
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Figure 6: Apache throughput and runtime breakdown. 


One scaling problem occurs in the memory allocator. 
Linux associates a separate allocator with each socket to 
allocate memory from that chip’s attached DRAM. The 
stock kernel allocates each packet from the socket nearest 
the PCI bus, resulting in contention on that socket’s allo- 
cator. We modified the allocation policy to allocate from 
the local socket, which improved throughput by ~30%. 

Another bottleneck was false read/write sharing of 
IXGBE device driver data in the net_device and 
device structures, resulting in cache misses for all cores 
even on read-only fields. We rearranged both structures 
to isolate critical read-only members to their own cache 
lines. Removing a single falsely shared cache line in 
net_device increased throughput by 30% at 48 cores. 

The final bottleneck was contention on the dst_entry 
structure’s reference count in the network stack’s destina- 
tion cache, which we replaced with a sloppy counter (see 
Section 4.3). 

The “PK” line in Figure 5 shows the scalability of 
memcached with these changes. The per core throughput 
drops off after 16 cores. We have isolated this bottleneck 
to the IXGBE card itself, which appears to handle fewer 
packets as the number of virtual queues increases. As a 
result, it fails to transmit packets at line rate even though 
there are always packets queued in the DMA rings. 

To summarize, while memcached scales poorly, the 
bottlenecks caused by the Linux kernel were fixable and 
the remaining bottleneck lies in the hardware rather than 
in the Linux kernel. 


5.4 Apache 


A single instance of Apache running on stock Linux scales 
very poorly because of contention on a mutex protecting 
the single accept socket. Thus, for stock Linux, we run 
a separate instance of Apache per core with each server 
running on a distinct port. Figure 6 shows that Apache 
still scales poorly on the stock kernel, even with separate 
Apache instances. 

For PK, we run a single instance of Apache 2.2.14 on 
one TCP port. Apache serves a single static file from an 
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ext3 file system; the file resides in the kernel buffer cache. 
We serve a file that is 300 bytes because transmitting a 
larger file exhausts the available 10 Gbit bandwidth at a 
low server core count. Each request involves accepting a 
TCP connection, opening the file, copying its content to a 
socket, and closing the file and socket; logging is disabled. 
We use 58 client processes running on 25 physical client 
machines (many clients are themselves multi-core). For 
each active server core, each client opens 2 TCP connec- 
tions to the server at a time (so, for a 48-core server, each 
client opens 96 TCP connections). 

All the problems and solutions described in Section 5.3 
apply to Apache, as do the modifications to the dentry 
cache for both files and sockets described in Section 4. 
Apache forks off a process per core, pinning each new pro- 
cess to a different core. Each process dedicates a thread 
to accepting connections from the shared listening socket 
and thus, with the accept queue changes described in Sec- 
tion 4.2, each connection is accepted on the core it initially 
arrives on and all packet processing is performed local to 
that core. The PK numbers in Figure 6 are significantly 
better than Apache running on the stock kernel; however, 
Apache’s throughput on PK does not scale linearly. 

Past 36 cores, performance degrades because the net- 
work card cannot keep up with the increasing workload. 
Lack of work causes the server idle time to reach 18% at 
48 cores. At 48 cores, the network card’s internal diagnos- 
tic counters show that the card’s internal receive packet 
FIFO overflows. These overflows occur even though the 
clients are sending a total of only 2 Gbits and 2.8 million 
packets per second when other independent tests have 
shown that the card can either receive upwards of 4 Gbits 
per second or process 5 million packets per second. 

We created a microbenchmark that replicates the 
Apache network workload, but uses substantially less 
CPU time on the server. In the benchmark, the client ma- 
chines send UDP packets as fast as possible to the server, 
which also responds with UDP packets. The packet mix 
is similar to that of the Apache benchmark. While the mi- 
crobenchmark generates far more packets than the Apache 
clients, the network card ultimately delivers a similar num- 
ber of packets per second as in the Apache benchmark 
and drops the rest. Thus, at high core counts, the network 
card is unable to deliver additional load to Apache, which 
limits its scalability. 


5.5 PostgreSQL 


We evaluate Linux’s scalability running PostgreSQL 8.3.9 
using both a 100% read workload and a 95%/5% 
read/write workload. The database consists of a sin- 
gle indexed 600 Mbyte table of 10,000,000 key-value 
pairs stored in tmpfs. We configure PostgreSQL to use 
a 2 Gbyte application-level cache because PostgreSQL 
protects its cache free-list with a single lock and thus 
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Figure 7: PostgreSQL read-only workload throughput and runtime 
breakdown. 
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Figure 8: PostgreSQL read/write workload throughput and runtime 
breakdown. 


scales poorly with smaller caches. While we do not pin 
the PostgreSQL processes to cores, we do rely on the 
IXGBE driver to route packets from long-lived connec- 
tions directly to the cores processing those connections. 

Our workload generator simulates typical high- 
performance PostgreSQL configurations, where middle- 
ware on the client machines aggregates multiple client 
connections into a small number of connections to the 
server. Our workload creates one PostgreSQL connection 
per server core and sends queries (selects or updates) in 
batches of 256, aggregating successive read-only transac- 
tions into single transactions. This workload is intended to 
minimize application-level contention within PostgreSQL 
in order to maximize the stress PostgreSQL places on the 
kernel. 

The “Stock” line in Figures 7 and 8 shows that Post- 
greSQL has poor scalability on the stock kernel. The first 
bottleneck we encountered, which caused the read/write 
workload’s total throughput to peak at only 28 cores, was 
due to PostgreSQL’s design. PostgreSQL implements 
row- and table-level locks atop user-level mutexes; as 
a result, even a non-conflicting row- or table-level lock 
acquisition requires exclusively locking one of only 16 
global mutexes. This leads to unnecessary contention for 
non-conflicting acquisitions of the same lock—as seen in 
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the read/write workload—and to false contention between 
unrelated locks that hash to the same exclusive mutex. We 
address this problem by rewriting PostgreSQL’s row- and 
table-level lock manager and its mutexes to be lock-free 
in the uncontended case, and by increasing the number of 
mutexes from 16 to 1024. 

The “Stock + mod PG” line in Figures 7 and 8 shows 
the results of this modification, demonstrating improved 
performance out to 36 cores for the read/write workload. 
While performance still collapses at high core counts, 
the cause of this has shifted from excessive user time to 
excessive system time. The read-only workload is largely 
unaffected by the modification as it makes little use of 
row- and table-level locks. 

With modified PostgreSQL on stock Linux, through- 
put for both workloads collapses at 36 cores, with sys- 
tem time rising from 1.7 jsseconds/query at 32 cores to 
322 yuseconds/query at 48 cores. The main reason is the 
kernel’s 1seek implementation. PostgreSQL calls lseek 
many times per query on the same two files, which in turn 
acquires a mutex on the corresponding inode. Linux’s 
adaptive mutex implementation suffers from starvation 
under intense contention, resulting in poor performance. 
However, the mutex acquisition turns out not to be neces- 
sary, and PK eliminates it. 

Figures 7 and 8 show that, with PK’s modified 1seek 
and smaller contributions from other PK changes, Post- 
greSQL performance no longer collapses. On PK, Post- 
greSQL’s overall scalability is primarily limited by con- 
tention for the spin lock protecting the buffer cache page 
for the root of the table index. It spends little time in the 
kernel, and is not limited by Linux’s performance. 


5.6 gmake 


We measure the performance of parallel gmake by build- 
ing the object files of Linux 2.6.35-rc5 for x86_64. All 
input source files reside in the buffer cache, and the output 
files are written to tmpfs. We set the maximum number 
of concurrent jobs of gmake to twice the number of cores. 

Figure 9 shows that gmake on 48 cores achieves ex- 
cellent scalability, running 35 times faster on 48 cores 
than on one core for both the stock and PK kernels. The 
PK kernel shows slightly lower system time owing to the 
changes to the dentry cache. gmake scales imperfectly 
because of serial stages at the beginning of the build and 
straggling processes at the end. 

gmake scales so well in part because much of the CPU 
time is in the compiler, which runs independently on 
each core. In addition, Linux kernel developers have 
thoroughly optimized kernel compilation, since it is of 
particular importance to them. 


5.7 Psearchy/pedsort 


Figure 10 shows the runtime for different versions of 
pedsort indexing the Linux 2.6.35-rc5 source tree, which 
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Figure 9: gmake throughput and runtime breakdown. 
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Figure 10: pedsort throughput and runtime breakdown. 


consists of 368 Mbyte of text across 33,312 source files. 
The input files are in the buffer cache and the output 
files are written to tmpfs. Each core uses a 48 Mbyte 
word hash table and limits the size of each output index 
to 200,000 entries (see Section 3.6). As a result, the 
total work performed by pedsort and its final output are 
independent of the number of cores involved. 

The initial version of pedsort used a single process with 
one thread per core. The line marked “Stock + Threads” in 
Figure 10 shows that it scales badly. Most of the increase 
in runtime is in system time: for | core the system time 
is 2.3 seconds, while at 48 cores the total system time is 
41 seconds. 

Threaded pedsort scales poorly because a per-process 
kernel mutex serializes calls to mmap and munmap for a 
process’ virtual address space. pedsort reads input files 
using libc file streams, which access file contents via 
mmap, resulting in contention over the shared address 
space, even though these memory-mapped files are logi- 
cally private to each thread in pedsort. We avoided this 
problem by modifying pedsort to use one process per 
core for concurrency, eliminating the mmap contention by 
eliminating the shared address space. This modification 
involved changing about 10 lines of code in pedsort. The 
performance of this version on the stock kernel is shown 
as “Stock + Procs” in Figure 10. Even on a single core, 
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the multi-process version outperforms the threaded ver- 
sion because any use of threads forces glibc to use slower, 
thread-safe variants of various library functions. 

With a small number of cores, the performance of the 
process version depends on how many cores share the per- 
socket L3 caches. Figure 10’s “Stock + Procs” line shows 
performance when the active cores are spread over few 
sockets, while the “Stock + Procs RR” shows performance 
when the active cores are spread evenly over sockets. As 
corroborated by hardware performance counters, the latter 
scheme provides higher performance because each new 
socket provides access to more total L3 cache space. 

Using processes, system time remains small, so the ker- 
nel is not a limiting factor. Rather, as the number of cores 
increases, pedsort spends more time in the glibc sorting 
function msort_with_tmp, which causes the decreasing 
throughput and rising user time in Figure 10. As the num- 
ber of cores increases and the total working set size per 
socket grows, msort_with_tmp experiences higher L3 
cache miss rates. However, despite its memory demands, 
msort_with_tmp never reaches the DRAM bandwidth 
limit. Thus, pedsort is bottlenecked by cache capacity. 


5.8 Metis 


We measured Metis performance by building an inverted 
index from a 2 Gbyte in-memory file. As for Psearchy, 
we spread the active cores across sockets and thus have 
access to the machine’s full L3 cache space at 8 cores. 

The “Stock + 4 KB pages” line in Figure 11 shows 
Metis’ original performance. As the number of cores 
increases, the per-core performance of Metis decreases. 
Metis allocates memory with mmap, which adds the new 
memory to a region list but defers modifying page ta- 
bles. When a fault occurs on a new mapping, the kernel 
locks the entire region list with a read lock. When many 
concurrent faults occur on different cores, the lock itself 
becomes a bottleneck, because acquiring it even in read 
mode involves modifying shared lock state. 

We avoided this problem by mapping memory with 
2 Mbyte super-pages, rather than 4 Kbyte pages, using 
Linux’s hugetlbfs. This results in many fewer page 
faults and less contention on the region list lock. We 
also used finer-grained locking in place of a global mutex 
that serialized super-page faults. The “PK + 2MB pages’ 
line in Figure 11 shows that use of super-pages increases 
performance and significantly reduces system time. 


> 


With super-pages, the time spent in the kernel becomes 
negligible and Metis’ scalability is limited primarily by 
the DRAM bandwidth required by the reduce phase. This 
phase is particularly memory-intensive and, at 48 cores, 
accesses DRAM at 50.0 Gbyte/second, just shy of the 
maximum achievable throughput of 51.5 Gbyte/second 
measured by our microbenchmarks. 
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Figure 11: Metis throughput and runtime breakdown. 
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Figure 12: Summary of the current bottlenecks in MOSBENCH, at- 
tributed either to hardware (HW) or application structure (App). 


5.9 Evaluation summary 


Figure 3 summarized the significant scalability improve- 
ments resulting from our changes. Figure 12 summarizes 
the bottlenecks that limit further scalability of MOSBENCH 
applications. In each case, the application is bottle- 
necked by either shared hardware resources or application- 
internal scalability limits. None are limited by Linux- 
induced bottlenecks. 


6 DISCUSSION 


The results from the previous section show that the MOS- 
BENCH applications can scale well to 48 cores, with mod- 
est changes to the applications and to the Linux kernel. 
Different applications or more cores are certain to reveal 
more bottlenecks, just as we encountered bottlenecks at 
48 cores that were not important at 24 cores. For exam- 
ple, the costs of thread and process creation seem likely 
to grow with more cores in the case where parent and 
child are on different cores. Given our experience scaling 
Linux to 48 cores, we speculate that fixing bottlenecks 
in the kernel as the number of cores increases will also 
require relatively modest changes to the application or 
to the Linux kernel. Perhaps a more difficult problem is 
addressing bottlenecks in applications, or ones where ap- 
plication performance is not bottlenecked by CPU cycles, 
but by some other hardware resource, such as DRAM 
bandwidth. 

Section 5 focused on scalability as a way to increase 
performance by exploiting more hardware, but it is usu- 
ally also possible to increase performance by exploiting 
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a fixed amount of hardware more efficiently. Techniques 
that a number of recent multicore research operating sys- 
tems have introduced (such as address ranges, dedicating 
cores to functions, shared memory for inter-core message 
passing, assigning data structures carefully to on-chip 
caches, etc. [11, 15, 53]) could apply equally well to 
Linux, improving its absolute performance and benefiting 
certain applications. In future work, we would like to 
explore such techniques in Linux. 

One benefit of using Linux for multicore research is that 
it comes with many applications and has a large developer 
community that is continuously improving it. However, 
there are downsides too. For example, if future processors 
don’t provide high-performance cache coherence, Linux’s 
shared-memory-intensive design may be an impediment 
to performance. 


7 CONCLUSION 


This paper analyzes the scaling behavior of a traditional 
operating system (Linux 2.6.35-rce5) on a 48-core com- 
puter with a set of applications that are designed for par- 
allel execution and use kernel services. We find that we 
can remove most kernel bottlenecks that the applications 
stress by modifying the applications or kernel slightly. 
Except for sloppy counters, most of our changes are ap- 
plications of standard parallel programming techniques. 
Although our study has a number of limitations (e.g., real 
application deployments may be bottlenecked by I/O), the 
results suggest that traditional kernel designs may be com- 
patible with achieving scalability on multicore comput- 
ers. The MOSBENCH applications are publicly available 
at http: //pdos.csail.mit.edu/mosbench/, so that 
future work can investigate this hypothesis further. 


ACKNOWLEDGMENTS 


We thank the anonymous reviewers and our shepherd, 
Brad Chen, for their feedback. This work was partially 
supported by Quanta Computer and NSF through award 
numbers 0834415 and 0915164. Silas Boyd-Wickizer is 
partially supported by a Microsoft Research Fellowship. 
Yandong Mao is partially supported by a Jacobs Presi- 
dential Fellowship. This material is based upon work 
supported under a National Science Foundation Graduate 
Research Fellowship. 


REFERENCES 


[1] Apache HTTP Server, May 2010. 
httpd.apache.org/. 


http:// 


[2] Exim, May 2010. http: //www.exim.org/. 


[3] Memcached, May http:// 


memcached.org/. 


2010. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


[4] PostreSQL, May 2010. 
www.postgresql.org/. 


http:// 


[5] The search for fast, scalable counters, May 2010. 
http://lwn.net/Articles/170003/. 


[6] J. Aas. Understanding the Linux 2.6.8.1 
CPU scheduler, February 2005. http:// 
josh. trancesoftware.com/linux/. 


[7] AMD, Inc. Six-core AMD opteron processor 
features. http://www.amd.com/us/products/ 
server/processors/six-core-opteron/ 
Pages/six-core-opteron-key-architectural 
-features.aspx. 


[8] T. E. Anderson, B. N. Bershad, E. D. Lazowska, 
and H. M. Levy. Scheduler activations: Effective 
kernel support for the user-level management of 
parallelism. In Proc. of the 13th SOSP, pages 95— 
109, 1991. 


[9] J. Appavoo, D. D. Silva, O. Krieger, M. Auslander, 
M. Ostrowski, B. Rosenburg, A. Waterland, R. W. 
Wisniewski, J. Xenidis, M. Stumm, and L. Soares. 
Experience distributing objects in an SMMP OS. 
ACM Trans. Comput. Syst., 25(3):6, 2007. 


[10] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, 
K. Keutzer, J. Kubiatowicz, N. Morgan, D. Pat- 
terson, K. Sen, J. Wawrzynek, D. Wessel, and 
K. Yelick. A view of the parallel computing land- 
scape. Commun. ACM, 52(10):56-67, 2009. 


[11] A. Baumann, P. Barham, P.-E. Dagand, T. Haris, 
R. Isaacs, S. Peter, T. Roscoe, A. Schiipbach, and 
A. Singhania. The Multikernel: a new OS architec- 
ture for scalable multicore systems. In Proc of the 
22nd SOSP, Big Sky, MT, USA, Oct 2009. 


[12] B. N. Bershad, T. E. Anderson, E. D. Lazowska, 
and H. M. Levy. Lightweight remote procedure call. 
ACM Trans. Comput. Syst., 8(1):37-55, 1990. 


[13] D. L. Black. Scheduling support for concurrency 
and parallelism in the Mach operating system. Com- 
puter, 23(5):35-43, 1990. 


[14] W. Bolosky, R. Fitzgerald, and M. Scott. Simple but 
effective techniques for NUMA memory manage- 
ment. In Proc. of the 12th SOSP, pages 19-31, New 
York, NY, USA, 1989. ACM. 


[15] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, 
F. Kaashoek, R. Morris, A. Pesterev, L. Stein, 
M. Wu, Y. D. Y. Zhang, and Z. Zhang. Corey: An 
operating system for many cores. In Proc. of the 8th 
OSDI, December 2008. 


USENIX Association 


[16] R. Bryant, J. Hawkes, J. Steiner, J. Barnes, and 


[17 


[18 


[19 


[20 


[21 


[22 


[23 


[24 


— 


“4 


] 


= 


] 


—“ 


] 


“4 


— 


USENIX Association 


J. Higdon. Scaling linux to the extreme. In Proceed- 
ings of the Linux Symposium 2004, pages 133-148, 
Ottawa, Ontario, June 2004. 


B. Cantrill and J. Bonwick. Real-world concurrency. 
Commun. ACM, 51(11):34—39, 2008. 


J. Corbet. The lockless page cache, May 2010. 
http://lwn.net/Articles/291826/. 


A. L. Cox and R. J. Fowler. The implementation of 
a coherent memory abstraction on a NUMA multi- 
processor: Experiences with platinum. In Proc. of 
the 12th SOSP, pages 32-44, 1989. 


J. Dean and S. Ghemawat. MapReduce: simplified 
data processing on large clusters. Commun. ACM, 
51(1):107-113, 2008. 


M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, 
K. Fall, G. Iannaccone, A. Knies, M. Manesh, and 
S. Ratnasamy. RouteBricks: Exploiting parallelism 
to scale software routers. In Proc of the 22nd SOSP, 
Big Sky, MT, USA, Oct 2009. 


F. Ellen, Y. Lev, V. Luchango, and M. Moir. SNZI: 
Scalable nonzero indicators. In PODC 2007, Port- 
land, Oregon, USA, Aug. 2007. 


GNU Make, May 2010. http://www. gnu.org/ 
software/make/. 


C. Gough, S. Siddha, and K. Chen. Kernel 
scalability—expanding the horizon beyond fine 
grain locks. In Proceedings of the Linux Sympo- 
sium 2007, pages 153-165, Ottawa, Ontario, June 
2007. 





T. Herbert. rfs: receive flow steering, September 
2010. http: //lwn.net/Articles/381955/. 


T. Herbert. rps: receive packet steering, September 
2010. http: //lwn.net/Articles/361440/. 


M. Herlihy. Wait-free synchronization. ACM Trans. 
Program. Lang. Syst., 13(1):124-149, 1991. 


J. Jackson. Multicore requires OS rework 
Windows architect advises. PCWorld mag- 
azine, 2010. http: //www.pcworld.com/ 
businesscenter/article/191914/ 
multicore_requires_os_rework_windows 
_architect_advises.html. 


Z. Jia, Z. Liang, and Y. Dai. Scalability evaluation 
and optimization of multi-core SIP proxy server. In 
Proc. of the 37th ICPP, pages 43-50, 2008. 


[30] 


[31] 


= 
Oo 
NO 
foes 


[33 


“4 


[34] 


[39] 


[41 


ra 


[42] 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


A. R. Karlin, K. Li, M. S. Manasse, and S. S. Ow- 
icki. Empirical studies of competitive spinning for a 
shared-memory multiprocessor. In Proc. of the 13th 
SOSP, pages 41-55, 1991. 


A. Kleen. An NUMA API for Linux, August 
2004. http://www. firstfloor.org/~ andi/ 
numa.html. 


A. Kleen. Linux multi-core scalability. In Proceed- 
ings of Linux Kongress, October 2009. 


J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Si- 
moni, K. Gharachorloo, J. Chapin, D. Nakahira, 
J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, 
and J. Hennessy. The Stanford FLASH multipro- 
cessor. In Proc. of the 21st ISCA, pages 302-313, 
1994. 


R. P. LaRowe, Jr., C. S. Ellis, and L. S. Kaplan. 
The robustness of NUMA memory management. In 
Proc. of the 13th SOSP, pages 137-151, 1991. 


J. Li, B. T. Loo, J. M. Hellerstein, M. F. Kaashoek, 
D. Karger, and R. Morris. On the feasibility of peer- 
to-peer web indexing and search. In Proc. of the 2nd 
IPTPS, Berkeley, CA, February 2003. 


Linux 2.6.35-rce5 source, July 
2010. Documentation/scheduler/ 
sched-design-CFS.txt. 


Linux kernel mailing list, May 2010. http:// 
kerneltrap.org/node/8059. 


Y. Mao, R. Morris, and F. Kaashoek. Optimizing 
MapReduce for multicore architectures. Technical 
Report MIT-CSAIL-TR-2010-020, MIT, 2010. 


P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, 
O. Krieger, and R. Russell. Read-copy update. In 
Proceedings of the Linux Symposium 2002, pages 
338-367, Ottawa, Ontario, June 2002. 


P. E. McKenney, D. Sarma, and M. Soni. Scal- 
ing dcache with reu, Jan. 2004. http:// 
www. linuxjournal.com/article/7124. 


J. M. Mellor-Crummey and M. L. Scott. Algorithms 
for scalable synchronization on shared-memory mul- 
tiprocessors. ACM Trans. Comput. Syst., 9(1):21-65, 
1991. 


E. M. Nahum, D. J. Yates, J. F Kurose, and 
D. Towsley. Performance issues in parallelized net- 
work protocols. In Proc. of the Ist OSDI, page 10, 
Berkeley, CA, USA, 1994. USENIX Association. 


15 


[43] D. Patterson. The parallel revolution has started: 


[44] 


[45 


[46 


[47 


[48 


[49 


[50 


[51 


] 


] 


] 


“4 


— 


] 


ra 


are you part of the solution or the prolem? In 
USENIX ATEC, 2008. www.usenix.org/event/ 
usenix08/tech/slides/patterson. pdf. 


A. Pesterev, N. Zeldovich, and R. T. Morris. Lo- 
cating cache performance bottlenecks using data 
profiling. In Proceedings of the ACM EuroSys Con- 
ference (EuroSys 2010), Paris, France, April 2010. 


C. Ranger, R. Raghuraman, A. Penmetsa, G. Brad- 
ski, and C. Kozyrakis. Evaluating MapReduce for 
multi-core and multiprocessor system. In Proceed- 
ings of HPCA. IEEE Computer Society, 2007. 


C. Schimmel. UNIX systems for modern architec- 
tures: symmetric multiprocessing and caching for 
kernel programmers. Addison-Wesley, 1994. 


M. D. Schroeder and M. Burrows. Performance 
of Firefly RPC. In Proc. of the 12th SOSP, pages 
83-90, 1989. 


J. Stribling, J. Li, I. G. Councill, M. F. Kaashoek, 
and R. Morris. Overcite: A distributed, cooperative 
citeseer. In Proc. of the 3rd NSDI, San Jose, CA, 
May 2006. 


J.H. Tseng, H. Yu, S. Nagar, N. Dubey, H. Franke, 
P. Pattnaik, H. Inoue, and T. Nakatani. Performance 
studies of commercial workloads on a multi-core 
system. IEEE Workload Characterization Sympo- 
sium, pages 57-65, 2007. 


R. Vaswani and J. Zahorjan. The implications of 
cache affinity on processor scheduling for multipro- 
grammed, shared memory multiprocessors. In Proc. 
of the 13th SOSP, pages 26-40, 1991. 


B. Veal and A. Foong. Performance scalability of 
a multi-core web server. In Proceedings of the 3rd 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


[52 


[55 


[56 


“4 


] 


] 


ss 


ACM/IEEE Symposium on Architecture for Network- 
ing and Communications Systems, pages 57-66, 
New York, NY, USA, 2007. 


B. Verghese, S. Devine, A. Gupta, and M. Rosen- 
blum. Operating system support for improving data 
locality on CC-NUMA compute servers. In Proc. 
of the 7th ASPLOS, pages 279-289, New York, NY, 
USA, 1996. ACM. 


D. Wentzlaff and A. Agarwal. Factored operating 
systems (fos): the case for a scalable operating 
system for multicores. SIGOPS Oper. Syst. Rev., 
43(2):76-85, 2009. 


C. Yan, Y. Chen, and S. Yuanchun. Parallel scalabil- 


ity comparison of commodity operating systems on 
large scale multi-cores. In Proceedings of the work- 


shop on the interaction between Operating Systems 
and Computer Architecture (WIOSCA 2009). 


C. Yan, Y. Chen, and S. Yuanchun. OSMark: A 
benchmark suite for understanding parallel scalabil- 
ity of operating systems on large scale multi-cores. 
In 2009 2nd International Conference on Computer 
Science and Information Technology, pages 313- 
317, 2009. 


C. Yan, Y. Chen, and S. Yuanchun. Scaling OLTP 
applications on commodity multi-core platforms. 
In 2010 IEEE International Symposium on Perfor- 
mance Analysis of Systems & Software (ISPASS), 
pages 134-143, 2010. 


M. Young, A. Tevanian, R. F. Rashid, D. B. Golub, 
J. L. Eppinger, J. Chew, W. J. Bolosky, D. L. Black, 
and R. V. Baron. The duality of memory and commu- 
nication in the implementation of a multiprocessor 
operating system. In Proc. of the 11th SOSP, pages 
63-76, 1987. 


USENIX Association 


USENIX Association 


Trust and Protection in the Illinois Browser Operating System 


Shuo Tang, Haohui Mai, Samuel T. King 
University of Illinois at Urbana-Champaign 


Abstract 


Current web browsers are complex, have enormous 
trusted computing bases, and provide attackers with easy 
access to modern computer systems. In this paper we in- 
troduce the Illinois Browser Operating System (IBOS), 
a new operating system and a new browser that re- 
duces the trusted computing base for web browsers. In 
our architecture we expose browser-level abstractions 
at the lowest software layer, enabling us to remove al- 
most all traditional OS components and services from 
our trusted computing base by mapping browser abstrac- 
tions to hardware abstractions directly. We show that this 
architecture is flexible enough to enable new browser se- 
curity policies, can still support traditional applications, 
and adds little overhead to the overall browsing experi- 
ence. 


1 Introduction 


Web-based applications (web apps), browsers, and op- 
erating systems have become popular targets for attack- 
ers of computer systems. Vulnerabilities in web apps 
are widespread and increasing. For example, cross-site 
scripting (XSS), which is effectively a form of script in- 
jection into a web app, recently overtook the ubiquitous 
buffer overflow as the most common security vulnerabil- 
ity [50]. Vulnerabilities in web browsers are less com- 
mon than web app vulnerabilities, but still occur often. 
For example, in 2009 Internet Explorer, Chrome, Safari, 
and Firefox had 349 new security vulnerabilities [4], and 
attackers exploit browsers commonly [53, 37, 42, 41, 4]. 
Vulnerabilities in libraries, system services, and oper- 
ating systems are less common than vulnerabilities in 
browsers, but are still problematic for modern systems. 
For example, glibc, GTK+, X, and Linux had 114 new 
security vulnerabilities in 2009 [1], and in 2009 the most 
commonly attacked vulnerability was a remote code ex- 
ecution bug in the Windows kernel [4]. 


However, not all attacks on web apps, browsers, and 
operating systems are equally virulent. At the top of the 
computer stack, attacks on web apps, such as XSS, oper- 
ate within current browser security policies that contain 
the damage to the vulnerable web app. Moving down 
the computer stack, attacks on browsers can cause more 
damage because a successful attack gives the attacker ac- 
cess to browser data for all web apps and access to other 
resources on the system. At the lowest layers of the 
computer stack, attacks on libraries, shared system ser- 
vices, and operating systems are the most serious attacks 
because attackers can access arbitrary states and events, 
giving them complete control of the system. 


Overall, these trends indicate that vulnerabilities 
higher in the computer stack are more common, but vul- 
nerabilities lower in the computer stack provide attack- 
ers with more control and are more damaging. In this 
paper we focus on preventing and containing attacks on 
browsers, libraries, system services, and operating sys- 
tems — the lower layers of the computer stack. 


Current research efforts into more secure web 
browsers help improve the security of browsers, but 
remain susceptible to attacks on lower layers of the 
computer stack. The OP web browser [26], Gazelle 
[52], Chrome [11], and ChromeOS [25] propose new 
browser architectures for separating the functionality 
of the browser from security mechanisms and policies. 
However, these more secure web browsers are all built 
on top of commodity operating systems and include 
complex user-mode libraries and shared system services 
within their trusted computing base (TCB). Even kernel 
designs with strong isolation between OS components 
(e.g., microkernels [24, 27, 28] and information-flow ker- 
nels [18, 57, 33]) still have OS services that are shared 
by all applications, which attackers can compromise and 
still cause damage. Here are a few ways that an attacker 
can still cause damage to more secure web browsers built 
on top of traditional OSes: 
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e A compromised Ethernet driver can send sensitive 
HTTP data (e.g., passwords or login cookies) to any 
remote host or change the HTTP response data be- 
fore routing it to the network stack. 


e Acompromised storage module can modify or steal 
any browser related persistent data. 


e A compromised network stack can tamper with any 
network connection or send sensitive HTTP data to 
an attacker. 


e A compromised window manager can draw any 
content on top of a web page to deploy visual at- 
tacks, such as phishing. 


In this paper we describe IBOS, an operating sys- 
tem and a browser co-designed to reduce drastically the 
TCB for web browsers and to simplify browser-based 
systems. Our key insight is that our lowest-layer soft- 
ware can expose browser-level abstractions, rather than 
general-purpose OS abstractions, to provide vastly im- 
proved security properties for the browser without affect- 
ing the TCB for traditional applications. Some examples 
of browser abstractions are cookies for persistent storage, 
hypertext transfer protocol (HTTP) connections for net- 
work I/O, and tabs for displaying web pages. To support 
traditional applications, we build UNIX-like abstractions 
on top of our browser abstractions. 

IBOS improves on past approaches by removing typi- 
cally shared OS components and system services from 
our browser’s TCB, including device drivers, network 
protocol implementations, the storage stack, and win- 
dow management software. All of these components run 
above a trusted reference monitor [9], which enforces our 
security policies. These components operate on browser- 
level abstractions, allowing us to map browser security 
policies down to the lowest-level hardware directly and 
to remove drivers and system services from our TCB. 

This architecture is a stark contrast to current systems 
where all applications layer application-specific abstrac- 
tions on top of general-purpose OS abstractions, inherit- 
ing the cruft needed to implement and access these gen- 
eral OS abstractions. By exposing application-specific 
abstractions at the OS layer, we can cut through complex 
software layers for one particular application without af- 
fecting traditional applications adversely, which still run 
on top of general OS abstractions and still inherit cruft. 
We choose to illustrate this principle using a web browser 
because browsers are used widely and have been prone 
to security failures recently. Our goal is to build a sys- 
tem where a user can visit a trusted web site safely, even 
one or more of the components on the system have been 
compromised. 

Our contributions are: 
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e IBOS is the first system to improve browser and OS 
security by making browser-level abstractions first- 
class OS abstractions, providing a clean separation 
between browser functionality and browser security. 


e We show that having low-layer software expose 
browser abstractions enables us to remove almost 
all traditional OS components from our TCB, in- 
cluding device drivers and shared OS services, al- 
lowing IBOS to withstand a wide range of attacks. 


e We demonstrate that IBOS can still support tradi- 
tional applications that interact with the browser and 
shared OS services without compromising the secu- 
rity of our system. 


2 The IBOS architecture 


This paper presents the design and implementation of 
the IBOS operating system and browser that reduce the 
TCB for browsing drastically. Our primary goals are to 
enforce today’s browser security policies with a small 
TCB, without restricting functionality, and without slow- 
ing down performance. To withstand attacks, IBOS must 
ensure any compromised component (1) cannot tamper 
with data it should not have access to, (2) cannot leak 
sensitive information to third parties, and (3) cannot ac- 
cess components operating on behalf of different web 
sites. 

In this section we discuss the design principles that 
guide our design and the overall system architecture. In 
Section 4 we discuss the security policies and mecha- 
nisms we use. 


2.1 Design principles 


We embrace microkernel [27], Exokernel [19], and 
safety kernel design principles in our overall architec- 
ture. By combining these principles with our insight 
about exposing browser abstractions at the lowest soft- 
ware layer we hope to converge on a more trustworthy 
browser design. Five key principles guide our design: 


1. Make security decisions at the lowest layer of soft- 
ware. By pushing our security decisions to the low- 
est layers we hope to avoid including the millions 
of lines of library and OS code in our TCB. 


2. Use controlled sharing between web apps and tra- 
ditional apps. Sharing data between web apps and 
traditional apps is a fundamental functionality of 
today’s practical systems and should be supported. 
However, this sharing should be facilitated through 
a narrow interface to prevent misuse. 
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Figure 1: Overall IBOS architecture. Our system con- 
tains user-mode drivers, browsers API managers, web 
page instances, and traditional processes. To manage the 
interactions between these components, we use a refer- 
ence monitor that runs within our IBOS kernel. Shaded 
regions make up the TCB. 





3. Maintain compatibility with current browser secu- 
rity policies. Our primary goal is to improve the 
enforcement of current browser policies without 
changing current web-based applications. 


4. Expose enough browser states and events to enable 
new browser security policies. In addition to en- 
forcing current browser policies, we would like our 
architecture to adapt easily to future browser poli- 
cies. 


5. Avoid rule-based OS sandboxing for browser com- 
ponents. Fundamentally, rule-based OS sandbox- 
ing is about restricting unused or overly permis- 
sive interfaces exposed by today’s operating sys- 
tems. However, sandboxing systems can be com- 
plex (the Ubuntu 10.04 SELinux reference policy 
uses over 104K lines of policy code) and difficult to 
implement correctly [23, 51]. If our architecture re- 
quires OS sandboxing for browser components then 
we should rethink the architecture. 


2.2 Overall architecture 


Figure 1 shows the overall IBOS architecture. The IBOS 
architecture uses a basic microkernel approach with a 
thin kernel for managing hardware and facilitating mes- 
sage passing between processes. The system includes 
user-mode device drivers for interacting directly with 
hardware devices, such as network interface cards (NIC), 
and browser API managers for accessing the drivers and 


implementing browser abstractions. The key browser 
abstractions that the browser API managers implement 
are HTTP requests, cookies and local storage for stor- 
ing persistent data, and tabs for displaying user-interface 
(UI) content. Web apps use these abstractions directly 
to implement browser functionality, and traditional ap- 
plications (traditional apps) use a UNIX layer to access 
UNIX-like abstractions on top of these browser abstrac- 
tions. 


2.2.1 The IBOS kernel 


Our IBOS kernel is the software TCB for the browser and 
includes resource management functionality and a refer- 
ence monitor for security enforcement. The IBOS kernel 
also handles many traditional OS tasks such as manag- 
ing global resources, creating new processes, and man- 
aging memory for applications. To facilitate message 
passing, the IBOS kernel includes the L4Ka::Pistachio 
[8] message passing implementation and MMU manage- 
ment functions. All messages pass through our reference 
monitor and are subjected to our overall system security 
policy. Section 4 describes the policies that the IBOS 
kernel enforces and the mechanisms it uses to implement 
these policies. 


2.2.2 Network, storage, and UI managers 


The IBOS network subsystem handles HTTP requests 
and socket calls for applications. To handle HTTP re- 
quests, network processes check a local cache to see if 
the request can be serviced via the cache, fetch any cook- 
ies needed for the request, format the HTTP data into a 
TCP stream, and transform that TCP stream into a series 
of Ethernet frames that are sent to the NIC driver. Socket 
network processes export a basic socket API and simply 
transform TCP streams to Ethernet frames for transmis- 
sion across the network. Only traditional apps can access 
our socket network processes. The IBOS kernel manages 
global states, like port allocation. 

The IBOS storage manager maintains persistent stor- 
age for key-value data pairs. The browser uses the stor- 
age manager to store HTTP cookies and HTMLS local 
storage objects, and the basic object store includes op- 
tional parameters, such as Path and Max—Age, to ex- 
pose cookie properties to the reference monitor. The 
storage manager uses several different namespaces to 
isolate objects from each other. Web apps and net- 
work processes share a namespace based on the origin 
(the <protocol, domain name, port> tuple of 
a uniform resource locator) that they originate from, 
and web apps and traditional apps share a “localhost” 
namespace, which is separate from the HTTP names- 
pace. All other drivers and managers have their own pri- 
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vate namespaces to access persistent data. 

The IBOS UI manager plays the role of the window 
manager for the system. However, rather than implement 
the browser UI components on top of the traditional win- 
dow motif, we opted for a tabbed browser motif. Basic 
browser UI widgets, called the browser chrome, are dis- 
played at the top of the screen. IBOS displays web pages 
in tabs and the user can have any number of tabs open for 
web apps. There is a tab for basic browser configuration 
and administration, and a tab that is shared by traditional 
apps. If traditional apps wish to implement the window 
motif, they can do so within the tab. The main advan- 
tage of our browser-based motif is that it enables IBOS 
to bypass the extra layers of indirection traditional win- 
dow managers put between applications and the under- 
lying graphics hardware, exposing browser UI elements 
and events directly to the IBOS kernel. We discuss the 
security implications of our design decision in more de- 
tail in Section 4.8. 


2.2.3. Web apps, traditional apps, and plugins 


The IBOS system supports two different types of pro- 
cesses: web page instances and traditional processes. A 
web page instance is a process that is created for each in- 
dividual web page a user visits. Each time the user clicks 
on a link or types a uniform resource locator (URL) into 
the address bar, the IBOS kernel creates a new web page 
instance. Web page instances are responsible for issuing 
HTTP requests, parsing HTML, executing JavaScript, 
and rendering web content to a tab. Traditional processes 
can execute arbitrary instructions, and the key difference 
between a web page instance and a traditional processes 
is that the IBOS kernel gives them different security la- 
bels, which the kernel uses for access control decisions. 
Web page instances are labeled with the origin of the 
HTTP request used to initiate the new web page, and tra- 
ditional processes are labeled as being from “localhost.” 
These two processes interact via the storage subsystem 
since both types of processes can access “localhost” data. 

In general, plugins are external applications that 
browsers use to render non-HTML content. One com- 
mon example of a plugin is the Flash player that enables 
browsers to play Flash content. In IBOS, plugins run as 
traditional processes, except that they are launched by 
the browser and the system gives them access to browser 
states and events through a standard plugin programming 
interface, called the NPAPI [2]. 


3 Current browser policies 


In this section we give a brief introduction to the same- 
origin policy (SOP) for browser security. For a more 
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complete discussion of this policy and others, plus exper- 
imental results showing how current browsers implement 
them, please see a recent paper by Singh, et al. [47]. 

The primary security policy that all modern browsers 
implement is the SOP. The SOP acts as a non- 
interference policy for the web. Loosely speaking, the 
SOP provides isolation for web pages and states that 
come from different origins — origins are used as labels 
for browser access control policies. If the browser has a 
web page open from uiuc.edu and from attacker. 
com, the SOP should ensure that these two web pages are 
isolated from each other. Unfortunately, Chrome, IE8, 
Safari, and Firefox all enforce the SOP using a number 
of checks scattered throughout the millions of lines of 
browser code and current browsers have had trouble im- 
plementing the SOP correctly [14]. 

In a browser, a frame is a container that encapsulates 
a HTML document and any material included in that 
HTML document. Web pages are frames, and web de- 
velopers can embed additional frames within web pages 
— these frames are called iframes. Developers can 
include iframes from the same origin as the hosting 
frame, or from a different origin. Each frame is labeled 
with the origin of the main HTML document used to pop- 
ulate the frame, meaning that a cross-origin if rame has 
a different label than the hosting web page. 

In general HTML documents include references to 
network objects that the browser will download and dis- 
play to form the web page. These network objects can 
be images, JavaScript, and CSS. Browsers can download 
these objects from any domain and the browser labels 
them with the origin of the hosting frame. For exam- 
ple, if a page from uiuc.edu includes a script from 
foo.com, that script runs with full uiuc.edu per- 
missions and can access any of the states in that web 
page. Browsers can also download HTML documents 
and XML HTTP requests (used for Ajax), but the SOP 
dictates that these objects must come from the same ori- 
gin as the hosting frame. 


4 IBOS security policies and mechanisms 


Our primary goal is to enforce browser security policies 
from within our IBOS kernel. This section describes the 
mechanisms that the IBOS kernel uses to enforce the 
SOP. We also discuss policies and mechanisms for en- 
forcing UI interactions, and we describe a custom policy 
engine that lets web sites further restrict current policies. 


4.1 Threat model and assumptions 


Our primary goal is to ensure that the IBOS kernel up- 
holds our security policies even if one or more of the sub- 
systems have been compromised. In our threat model, 
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Figure 2: This figure enlarges the right half of Figure 1 
and shows how our IBOS subsystems interact when a 
web page instance from uiuc.edu issues a network 
request to foo.com. Subsystems are shown in boxes 
and solid and dotted arrows represent IBOS messages for 
outgoing and incoming data respectively. The reference 
monitor (which is not shown here) checks all these mes- 
sages to enforce security properties. 





we assume that an attacker controls a web site and can 
serve arbitrary data to our browser, or that the system 
contains a malicious traditional app. We also assume that 
this malicious data or traditional app can compromise 
one or more of the components in our system. These 
susceptible components include all drivers, browser API 
managers, web page instances, and traditional processes. 
Once the attacker takes control of these components, we 
assume that he or she can execute arbitrary instructions 
as a result of the attack. We focus on maintaining the in- 
tegrity and confidentiality of the data in our browser. In 
other words, we would like the user to be able to open a 
web page on a trusted web server, and interact with this 
web page securely, even if everything on the client sys- 
tem outside of our TCB has been compromised. Avail- 
ability is an important, but separate, aspect of browser 
security that we do not address in this paper. 

In our system we trust the layers upon which we built 
IBOS. These layers include the IBOS kernel and the un- 
derlying hardware. Like all other browsers, IBOS pred- 
icates security decisions based on domain names, so we 
trust domain name servers to map domain names to IP 
addresses correctly. Compromising any of these trusted 
layers compromises the security of IBOS. 


4.2 IBOS work flow 


This section describes a web page instance making a net- 
work request to help illustrate the security mechanisms 
that IBOS uses. 


Figure 2 shows the flow of how a web page instance 
fetches data from the network. The user visits a page 
hosted at uiuc.edu and this web page includes an im- 
age from foo. com. To download the image, (1) the web 
page instance will make an HTTP request that the IBOS 
kernel forwards to an appropriate network process. The 
network process forms a HTTP request, which includes 
setting up HTTP headers, (2) fetching cookies from the 
storage subsystem, (3) requesting a free local TCP port 
to transform this request into TCP/IP packets and Ether- 
net frames, and (4) sending it to network manager. The 
network manager notifies the Ethernet driver which (5) 
programs the NIC to transmits the packet out to the net- 
work. When the NIC receives a reply for the request, (6) 
it notifies the Ethernet driver. The driver subsequently 
(7) notifies the network manager, which (8) forwards the 
packet to the appropriate network process. The network 
process then parses the data and (9) passes the resulting 
HTTP reply and data to the original web page instance. 


4.3 IBOS labels 


To enforce access control decisions, the IBOS kernel la- 
bels web page instances, traditional processes, and net- 
work processes. IBOS labels specify the resources that 
a process can access or messages it can receive. Each 
web page instance has one label, which is the origin of 
the main HTML document. Each traditional process is 
labeled as being from “localhost” when they are created. 
Each network process has an origin label for the network 
resources it handles and has an origin label for the web 
page instances that are allowed to access it. IBOS la- 
bels the processes upon creation, and keeps the labels 
unchanged throughout the processes’ life-cycle. 

An important point is that the IBOS kernel infers the 
origin labels for web page instances and network pro- 
cesses automatically by extracting related information 
from the messages passed among them. By inferring la- 
bels rather than relying on processes to label themselves, 
the IBOS kernel ensures that it has the correct label in- 
formation, even if a process is compromised. 

The newUrl and fetchUrl IBOS system calls are the 
two requests that cause the kernel to label processes. The 
newUrl system call is used by web page instances and the 
UI manager use to navigate the browser to a new URL. 
The newUrl system call consists of two arguments: a 
URL and a byte array for HTTP POST data. When the 
IBOS kernel receives a newUrl request it will create a 
new web page instance and set the label for this web page 
instance by parsing the origin out of the URL argument 
of the newUrl request. When servicing newUrl requests, 
the IBOS kernel will reuse old web page instances (to 
reduce process startup times), but only when the origin 
labels match for the old web page instance and the URL 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 21 


22 


argument. 

Web page instances use the fetchUrl system call to is- 
sue HTTP and HTTPS requests to fetch network objects, 
such as images. The fetchUrl system call has two ar- 
guments: a URL and HTTP header information. When 
a web page instance issues a fetchUrl system call, the 
IBOS kernel uses the origin of the web page instance 
(set by the original newUrl call) and the origin of the 
fetchUrl URL argument to find a network process with 
these same labels, or creates a new network processes 
and labels it accordingly if an existing network process 
cannot be found. 

More details about how we use these labels for access 
control decisions are described in the remainder of this 
section. 


4.4 Security invariants 


For all of our subsystems, we use security invariants that 
are assertions on all interactions between subsystems that 
check basic security properties. The key to our security 
invariants is that we can extract security relevant infor- 
mation from messages automatically, and provide high 
assurance that the system maintains the security policy 
without having to understand how each individual sub- 
system is implemented. Using these security invariants, 
we remove from the TCB almost all of the components 
found in modern commodity operating systems, includ- 
ing device drivers. 

The ideal security invariant is complete, implementa- 
tion agnostic, executes quickly, and requires only a small 
amount of code in the IBOS kernel. A complete invariant 
can infer all of the states needed to ensure the high-level 
security policy, and an implementation agnostic invari- 
ant can infer states without relying on the specific imple- 
mentation of individual subsystems. The IBOS kernel 
evaluates invariants in the kernel and inline with mes- 
sages, so security invariants should execute quickly and 
require little code to implement. In our design we strive 
to make the appropriate trade offs among these proper- 
ties to improve security without making the system slow 
or increasing our TCB significantly. The base security 
invariant we have is: 


SI 0: All components can only perform their designated 
functions. 


For example, the UI subsystem can never ask for 
cookie data or the storage manager cannot impersonate 
a network process to send synthesized attack HTTP data 
to a web page instance. 


4.5 Driver invariants 


The two driver invariants the IBOS kernel enforces are: 
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SI 1: Drivers cannot access DMA buffers directly. 
SI 2: Devices can only access validated DMA buffers. 


In our approach, we use a split driver architecture 
where we separate the management of device control reg- 
isters from the use of device buffers (SI 1). For example, 
our Ethernet driver never has access to transmit or re- 
ceive buffers directly. Instead, it knows the physical ad- 
dresses where the IBOS kernel stores these buffers, and 
it programs the NIC to use them. By separating these 
two functions we can interpose on the communications 
between them to ensure that IBOS upholds browser secu- 
rity policies, even if an attacker completely compromises 
a shared driver. 

Using this split architecture, processes fill in device- 
specific buffers for DMA transfers, and the IBOS ker- 
nel infers when drivers initiate DMA transfers to ensure 
that the driver instructs the device to use a verified DMA 
buffer (SI 2). Fortunately, DMA buffers tend to use 
well-defined interfaces, like Ethernet frames for Ether- 
net drivers, so the IBOS kernel can readily glean security 
relevant information from these DMA buffers before the 
device accesses them. Unfortunately, the interface be- 
tween drivers and devices is device-specific, so the IBOS 
kernel must have a small state machine for each device 
to properly infer DMA transfers. However, we found this 
state machine to be quite small for the devices that we use 
in IBOS. 

In IBOS we implement a driver for the e1000 NIC, a 
VESA BIOS Extensions driver for our video card, and 
drivers for the mouse and keyboard. 


4.6 Storage invariants 


The primary invariant we strive to enforce in the storage 
manager is: 


SI 3: All of our key-value pairs maintain confidentiality 
and integrity even if the storage stack itself becomes 
compromised. 


To enforce this invariant, our IBOS kernel encrypts 
all objects before passing them to the storage subsystem. 
To encrypt data, the IBOS kernel maintains separate en- 
cryption keys for all of the namespaces on the IBOS sys- 
tem. These namespaces include separate namespaces for 
HTTP cookies based on the domain of the cookie, sep- 
arate namespaces for web page instances based on the 
origin of the page, separate namespaces for each of our 
subsystems, and a separate namespace for all traditional 
apps. When the IBOS kernel passes a request to the stor- 
age manager it will append the security labels, a copy 
of the key from the key-value pair, and a hash of the 
contents to the payload before encrypting the data and 
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Passing it to the storage subsystem. When the IBOS ker- 
nel retrieves this data, it can decrypt the data and check 
the labels and integrity of the information. By using en- 
cryption, the IBOS kernel does not need to implement 
security invariants for any of our storage drivers, and our 
storage subsystem is free to make data persistent using 
any mechanisms it sees fit, such as the network (like in 
our implementation) or via a disk-based storage system. 

Our current implementation does not make any efforts 
to avoid an attacker that deletes objects or replays old 
storage data. For web applications this limitation has 
only a small effect because the cookie standards do not 
require browsers to keep cookies persistently and be- 
cause web applications often limit the lifetime of cookies 
using expiration dates, which are also part of the cookie 
standard. However, if this limitation did become prob- 
lematic, we could apply the principles learned from dis- 
tributed or secure file systems to provide stronger guar- 
antees. 


4.7 Network process invariants 


Our IBOS kernel maintains five main invariants for net- 
work processes: 


S14: The kernel must route network requests from web 
page instances to the proper network process. 

SI5: The kernel must route Ethernet frames from the 
NIC to the proper network processes. 

S16: Ethernet frames from network processes to the 
NIC must have an IP address and TCP port that 
matches the origin of the network process. 

SI7: HTTP data from network processes to web page 
instances must adhere to the SOP. 

SI 8: Network processes for different web page in- 
stances must remain isolated. 


To help enforce these invariants, IBOS puts all net- 
work processes in their own protection domains. If a web 
page instance makes a HTTP request, the kernel will ex- 
tract the origin from the request message and either route 
this request to an existing network process that has the 
same label, or it will create a new network process and 
label the network process with the origin of the HTTP 
request. Likewise, the kernel inspects incoming Ether- 
net frames to extract the origin and TCP port informa- 
tion, and routes these frames to the appropriately labeled 
network process. By putting network processes in their 
own protection domains, the kernel naturally ensures that 
network requests from web page instances and Ethernet 
frames from the NIC are routed to the correct network 
process (SI 4) (SI 5). 

To ensure that the NIC sends outgoing Ethernet frames 
to the correct host, the IBOS kernel checks all outgoing 
Ethernet frames before sending them to the NIC to check 


the IP address and TCP port against the label of the send- 
ing network process (SI 6). Also, the IBOS kernel checks 
cookies before passing them to the network process to 
ensure that all of the origin labels adhere to cookie stan- 
dards. By performing these checks, the IBOS kernel en- 
sures that the NIC sends outgoing network requests to 
the proper host and that the request can only include data 
that would be available to the server anyway. 

To enforce the SOP, the IBOS kernel inspects HTTP 
data before forwarding it to the appropriate web page 
instance and drops any HTML documents from differ- 
ent origins (SI 7). To inspect data, the kernel uses the 
content sniffing algorithm from Chrome [10] to identify 
HTML documents so the kernel can check to make sure 
that the origin of HTML documents and the origin of the 
web page instance match. This countermeasure prevents 
compromised web page instances from peering into the 
contents of a cross-origin HTML document, thus pre- 
venting the compromised web page instance from read- 
ing sensitive information included in the HTML docu- 
ment. 

To help isolate web page instances from each other, 
we also label network processes with the origin of the 
web page instance (SI 8). This second label is used only 
for network access control decisions and does not affect 
the cookie policy, which is predicated on the origin of 
the network request. To access network processes, the 
origin of the web page instance must match the origin of 
this second label. By using this second label, the IBOS 
kernel isolates network requests from different web page 
instances to the same origin. As a result of this isolation, 
a web page instance that is served a malicious network 
resource (e.g., a malicious ad [41]) that compromises a 
network process remains isolated from other web page 
instances. If an attacker can compromise a network pro- 
cess, IBOS limits the damage to the web page instance 
that included the malicious content. 


4.8 Ul invariants 


The three UI invariants that the IBOS kernel enforces are: 


SI9: The browser chrome and web page content dis- 
plays are isolated. 

SI10: Only the current tab can access the screen, 
mouse, and keyboard. 

SI11: The URL of the current tab is displayed to the 
user. 


The key mechanisms that our UI subsystem uses to 
provide isolation are to use a frame buffer video driver 
and page protections to isolate portions of the screen (SI 
9). Our video driver uses a section of memory, called 
a frame buffer, for writing to the screen. Processes 
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Figure 3: IBOS display isolation. This figure shows how 
IBOS divides the display into three main parts: a bar at 
the top for the kernel, a bar for browser chrome, and the 
rest for displaying web page content. The IBOS kernel 
enforces this isolation using page protections and without 
relying on a window manager. 





write pixel values to this frame buffer and the graph- 
ics card displays these pixels. Although our mechanism 
makes heavy use of the software rastering available in Qt 
Framework[3], our experiences and anecdotal evidence 
from the Qt developers shows that software rastering can 
perform roughly as fast as native X drivers running on 
Linux [7]. The key advantage of our approach is that 
the IBOS kernel can use standard page-protection mech- 
anisms to isolate portions of the screen. Although our 
current implementation does not support hardware accel- 
eration, we believe that our techniques will work because 
the IBOS kernel can interpose on standardized accelera- 
tion hardware/software interfaces, such as OpenGL and 
DirectX. 

To provide screen isolation, we divide up the screen 
into three horizontal portions (Figure 3). At the top, we 
reserve a small bar that only the IBOS kernel can access. 
We use the next section of the screen for the UI subsys- 
tem to draw the browser chrome. Finally, we provide 
the remainder of the screen to the web page instance. To 
ensure that only one web page instance can write to the 
screen at any given time, we only map the frame buffer 
memory region into the currently active web page in- 
stance and we only route mouse and keyboard events to 
this currently active web page instance (SI 10). 

To switch tabs, the UI subsystem notifies the IBOS 
kernel about which tab is the current tab, and the IBOS 
kernel updates the frame buffer page table entries ap- 
propriately. However, a malicious UI manager could 
switch tabs arbitrarily and cause the address bar and the 
tab content to become out of sync (e.g., shows a page 
from attacker.com, but claims the page comes from 
uiuc.edu). One alternative we considered for this UI 
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inconsistency was interposing on mouse and keyboard 
clicks to infer which tab the user clicked on, and also 
performing optical character recognition on the address 
bar to determine the address that the UI manager is dis- 
playing. However, tracking this level of detail would re- 
quire far too much implementation specific information 
and would require the IBOS kernel to track additional 
events like a user switching the order of tabs. 

Our approach for the IBOS kernel is to use the kernel 
display area to display the URL for the currently visi- 
ble web page instance (SI 11). The kernel derives the 
URL from the label of the currently visible web page 
instance, providing high assurance that the URL the ker- 
nel displays matches the URL of the visible web page 
instance without tracking implementation specific states 
and events in the UI manager. Although this security in- 
variant appears simple, it is something that modern web 
browsers have had trouble getting right [13]. 


4.9 Web page instances and iframes 


The IBOS kernel creates a new web page instance each 
time a user clicks on a link or types a new URL in the 
address bar. To enforce the SOP on iframes, we run 
cross-origin iframes in separate web page instances. 
This separation allows us to fully track the SOP using 
kernel visible entities. To facilitate communication be- 
tween web page instances and the iframes that they 
host, we marshal postMessage calls between the two. 

Our current display isolation primitives are coarse 
grained and we rely on the web page instance to manage 
cross-origin iframe displays even though iframes 
run in separate protection domains. However, current 
display policies allow web page instances to draw over 
cross-origin iframes that they host, so this design deci- 
sion has no impact on current browser policies. One po- 
tential shortcoming of this display management approach 
is that compromised web page instances can read the dis- 
play data for embedded iframes. Fortunately, many 
sites with sensitive information, like facebook.com 
and gmail.com, use frame busting techniques [34] to 
prevent cross-origin sites from embedding them, which 
the IBOS kernel can enforce. 


4.10 Custom policies 


Our main focus of this project is being able to enforce 
current browser policies from the lowest layer of soft- 
ware. However, we also want to create an architecture 
that exposes enough browser states and events to en- 
able novel browser security policies. Attacks such as 
XSS operate within traditional browser policies and can 
be difficult to prevent without relying on the HTML or 
JavaScript engine implementations. Although our archi- 
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tecture cannot prevent XSS, our goal is to prevent these 
types of attacks from causing damage. 

One mechanism we implement in IBOS is to give 
a web server the ability to create its own more re- 
strictive security policy to prevent attacks from sending 
sensitive information to third-party hosts. In our cus- 
tom policy, we allow web sites to specify a server-side 
policy file that IBOS retrieves to restrict network ac- 
cesses for a web page instance, similar to Tahoma man- 
ifests [15]. For example, assume that a bank website 
located at http: //www.bank.com creates a policy 
file at http: //www.bank.com/.policy that spec- 
ifies the online bank system can only access resources 
from www. bank.comor data.bank.com. IBOS re- 
trieves the policy file and automatically applies a more 
restrictive policy for the online bank web application. 
This restrictive policy prevents an attacker from sending 
stolen information to a third-party host, providing an ad- 
ditional layer of protection for the web application. 


5 Implementation 


The implementation of IBOS is divided into three parts: 
the IBOS kernel, IBOS messaging passing interfaces, 
and IBOS subsystems. The IBOS kernel is implemented 
on top of the L4Ka::Pistachio microkernel and runs on 
X86-64 uniprocessor and SMP platforms. We modi- 
fied L4Ka to improve its support for SMP systems. The 
IBOS kernel schedules processes based on a static prior- 
ity scheduling algorithm. 

The IBOS kernel provides three basic APIs (i.e., 
send(), recv(), and poll ()) to facilitate message 
passing. Applications use send() and recv() for 
communication and call poll () to wait for new mes- 
sages. The IBOS kernel intercepts all messages and au- 
tomatically extracts the semantics from them, like cre- 
ating a new web page instance or forwarding cookies to 
network processes. Then the kernel inspects the seman- 
tics to make sure they conform to all security invariants 
and policies that we described in previous sections. 

The IBOS subsystems implements APIs for web 
browsers and traditional applications. They are built on 
top of an IBOS-specific uClibe [6] C library, lwIP [17] 
TCP/IP stack and the Qt Framework [3]. The web 
browser also uses an IBOS-specific WebKit [5] to parse 
and render web pages. 

To support traditional apps, we use our uClibc and Qt 
implementations to provide access to browser abstrac- 
tions using the UNIX-like abstractions of the C runtime, 
and GUI support from Qt. We use a few Qt sample pro- 
grams for testing and we implement one plugin. Our plu- 
gin is a PDF viewer that uses the Ghostscript PDF ren- 
dering engine with bindings for Qt. 





























System LOC 
IBOS 42,044 
IBOS Kernel 8,905 
L4Ka::Pistachio 33,139 
Firefox on Linux > 5,684,639 
Firefox 3.5 2,171,267 
GTK+ 2.18 489,502 
glibc 2.11 740,314 
X.Org 7.5 653,276 
Linux kernel 2.6.31 1,630,280 
ChromeOS > 4,407,066 
Chrome browser kernel 4.1.249 714,348 
GTK+ 2.18 489,502 
glibc 2.11 740,314 
ChromeOS kernel & services (May 2010) 2,462,902 


Table 1: Estimation of LOC of TCBs for IBOS, Firefox 
on Linux, and ChromeOS. LOC counts are also shown 
for some major components that are included in the TCB. 





6 Evaluation 


This section describes our evaluation of IBOS. In our 
evaluation, we analyze the security of IBOS by measur- 
ing the number of lines of code (LOC) in the IBOS TCB 
and comparing it with other systems, and by looking at 
recent bugs in comparable systems and counting vulner- 
abilities that IBOS is susceptible to. We also revisit the 
example attacks we discussed in the introduction, and we 
measure the performance. 


6.1 TCB 


In IBOS, our goal is to minimize the TCB for web 
browsers and to simplify browser-based systems. To 
quantitatively evaluate our effort, we count the LOC in 
the IBOS TCB and compare it against the TCB for Fire- 
fox and ChromeOS. IBOS supports fewer hardware ar- 
chitectures, platforms, device drivers and features, such 
as browser extensions, than Firefox running on Linux 
and ChromeOS. For a fair comparison, we only count 
source code that is used for running above Linux and on 
the X86-64 platform. Also, we omit all device drivers 
from our counts except for the drivers we implement in 
IBOS. 

Table 1 shows the result of LOC counts in the TCB for 
these three systems, measured by SLOCCount [54]. For 
Firefox and ChromeOS, our counts are conservative be- 
cause we only count the major components that make up 
the TCB for each system — there are likely more compo- 
nent that are also in the TCB for these systems. Because 
the IBOS TCB has only around 42K LOC, it is possible 
to formally verify or manually review the entire IBOS 
TCB. And in fact, one L4 type microkernel has already 
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Affected Component || Num. | Prevented 
Linux kernel overall 21 20 (95%) 
File system 12 12 (100%) 
Network stack > 5 (100%) 
Other 4 3 (75%) 
X Server 2 2 (100%) 
GTK+ & glibc 5 5 (100%) 
Overall 28 27 (96 %) 




















Table 2: OS and library vulnerabilities. This table shows 
the number of vulnerabilities that IBOS prevents. 





been formally verified [32]. 


6.2 OS and library vulnerabilities 


To evaluate the security impact of IBOS’s reduced TCB, 
we obtained a list of 74 vulnerabilities found in the Linux 
kernel, X Server, GTK+, and glibc this year so far (as 
of Sep. 18, 2010) [1] to see how the IBOS architecture 
handles them. Out of the 74 vulnerabilities, 20 are re- 
lated to unsupported hardware architectures and devices, 
and 26 cause denial-of-service, which is out-of-scope for 
this paper. For the remaining 28, we classify them based 
on the subsystem the vulnerability lies in to determine if 
IBOS is susceptible to these vulnerabilities. 

Table 2 shows IBOS is able to prevent 27 of 28 vul- 
nerabilities (96%). The only vulnerability we miss is 
a memory corruption vulnerability in the e1000 Ether- 
net driver. Normally IBOS is not susceptible to bugs in 
device drivers, but this particular bug resulted from the 
driver not accounting properly for Ethernet frames larger 
than 1500 bytes, and this type of logic is what our NIC 
verification state machine uses, so we counted this bug 
against IBOS. 


6.3. Browser vulnerabilities 


To evaluate security improvements that IBOS makes 
for browsers themselves, we compared how well 
IBOS could contain or prevent vulnerabilities found in 
Google’s Chrome browser. For this evaluation, we ob- 
tained a list of 295 publicly visible bugs with the “se- 
curity” label in Chrome’s bug tracker. Out of the 295 
bugs, 42 cause denial-of-service such as a simple crash or 
100% CPU utilization. IBOS does not address denial-of- 
service or resource management currently. An additional 
78 are either invalid, duplicate, not actually security is- 
sues, or related to features that IBOS does not have, such 
as browser extensions. For the remaining 175 bugs, we 
examined each of them to the best of our knowledge and 
classified them into the following seven categories and 
compared how Chrome and IBOS handle those cases: 
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Memory exploitation: an attacker could use a memory 
corruption bug to deploy a remote code execution attack. 
For Chrome, if the bug is in its rendering engine, Chrome 
contains the attack. However, bugs in the browser kernel 
give attackers access to the entire browser. For IBOS, 
bugs in either the rendering engine or other service com- 
ponents are contained as they are all out of the TCB. 

XSS: browsers rely on careful sanitization and correct 
processing of different encodings to prevent XSS attacks. 
For both Chrome and IBOS, it is infeasible to eliminate 
XSS attacks, but they both contain the attacks in the af- 
fected web apps. 

SOP circumvention: Chrome runs contents in frames 
from different origins in a single address space and uses 
scattered “if” and “else” statements to enforce the same- 
origin policy. This logic can be sometime subverted. In 
IBOS, we run iframes in different web page instances to 
provide strong isolation and check cross-origin access in 
the IBOS kernel. 

Sandbox bypassing: Chrome uses sandboxing tech- 
niques, such as SELinux, to limit the rendering engine’s 
authority. However, rule-based sandboxing is complex 
and can be bypassed in some scenarios. In IBOS, we 
designed browser abstractions to restrict the authority of 
each subsystem, which are immune to this kind of prob- 
lem naturally. 

Interface spoofing: browsers are sometime vulnerable 
to visual attacks in which a malicious website can use 
complex HTTP redirection or even replicate the “look 
and feel” of victim websites to deploy phishing. Chrome 
uses a blacklist-based filter to warn users of malicious 
websites. In IBOS, the IBOS kernel separates the dis- 
play of different web page instances and uses the labels 
of web page instances to display the correct URL in the 
top of the screen to give the user a visual cue of which 
website he or she is visiting. 

UI design flaw: some security concerns arise because 
of careless implementation, such as showing users’ pass- 
words in plain text. Both Chrome and IBOS are vulnera- 
ble to this type of problem. 

Misc: some vulnerabilities could not easily be classi- 
fied and mostly have low security severity. This is the 
category for those remaining bugs. 

In Table 3, we show the detailed results of the analysis 
of the 175 vulnerabilities, broken down by the classifi- 
cations above. We examined each of them to determine 
whether Chrome contains the threats in the affected com- 
ponents, and whether IBOS contains or eliminates the at- 
tacks. The table shows IBOS successfully protects users 
from 135 of the 175 vulnerabilities (77%). 

The largest portion of bugs are browser implementa- 
tion flaws that cause memory corruption and allow re- 
mote code execution. Chrome does a fairly good job 
containing most of them when they are in the rendering 
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Chrome IBOS 

Category Example Num. | Contained | Contained or eliminated 
Memory exploitation || A bug in layout engine leads to remote code execution 82 71 (86%) 79 (96%) 

XSS XSS issue due to the lack of support for ISO-2022-KR 14 12 (87%) 14 (100%) 

SOP circumvention XMLHttpRequest allows loading from another origin 21 0 (0%) 21 (100%) 
Sandbox bypassing Sandbox bypassing due to directory traversal 12 0 (0%) 12 (100%) 
Interface spoofing Two pages merge together in certain situation 6 0 (0%) 6 (100%) 

Ul design flaw Plain-text information leak due to autosuggest 17 0 (0%) 0 (0%) 

Misc Geolocation events fire after document deletion 22 0 (0%) 3 (14%) 
Overall 175 83 (46%) 135 (77%) 

















Table 3: Browser vulnerabilities. This table shows the number of Chrome vulnerabilities that Chrome itself contains 


and IBOS contains or eliminates. 





engine. However, Chrome is unable to contain exploits 
in the browser kernel. A good example is a bug in the 
HTTP chunked encoding module in the browser kernel, 
which opens the possibility for a remote attacker to inject 
code. In IBOS, the TCP/IP and HTTP stack is pushed out 
of the TCB, and is replicated and isolated according to 
browser security policies. Thus, IBOS is able to contain 
this bug. The three memory corruption bugs IBOS could 
not contain were from bugs in Chrome’s message pass- 
ing system. Because the IBOS message passing logic 
resides within our TCB, we counted these bugs as bugs 
that IBOS would have missed. 


6.4 Motivation revisited 


In the introduction, we listed some examples of attacks 
that an attacker can use to still cause damage to modern 
secure web browsers by exploiting code in their TCB. 
We revisit these examples again to argue that IBOS can 
prevent them. 

A compromised Ethernet driver cannot access the 
DMA buffers used by the device. Even if an attacker 
exploits the Ethernet driver, he or she still cannot tamper 
with network packets because the driver does not have 
access to DMA buffers and because the IBOS kernel val- 
idates all transmit and receive buffers that the driver sets. 

A compromised storage module has little impact on 
data confidentiality and integrity. The IBOS kernel en- 
crypts all data with secret keys that only the IBOS ker- 
nel has access to. Stored objects are tagged with a hash 
and origin information so that the IBOS kernel is able 
to detect tampered data. The only thing a compromised 
storage module can do is delete objects. 

A compromised network stack is constrained as well. 
In IBOS, every network process runs a complete net- 
work stack. A compromised network process cannot 
send users’ data to a third party host as the IBOS ker- 
nel ensures it can only communicate with the expected 
host. Network processes do have the ability to modify or 
replay HTTP requests, but the web server might have a 


mechanism to defend against replay attacks. 

A Compromised window manager cannot affect other 
subsystems in IBOS. In IBOS, the role of window man- 
ager is simplified to only draw the browser chrome. It 
can change some potentially sensitive information, such 
web page titles. However, the IBOS kernel displays the 
URL of the current tab in the kernel display area, provid- 
ing users with some visual cues as to the provenance of 
the displayed web content. 


6.5 Performance 


To evaluate the performance implication of IBOS’s ar- 
chitecture, we compare its browsing experience to other 
web browsers running in Linux. All experiments were 
carried out on a 2.33GHz Intel Core 2 Quad CPU 
Q8200 with 4GB of memory, a 320GB 7200RPM Sea- 
gate ST3320613 SATA hard drive and an Intel PRO/1000 
NIC connected to 1000 Mbps Ethernet. For Linux, we 
used Ubuntu 9.10 with kernel version 2.6.3 1-16-generic 
(x86-64). 

We use page load latency to represent browsing ex- 
perience. Page load latency is defined as the elapsed 
time between initial URL request and the DOM onload 
event. We compare IBOS with Firefox 3.5.9, Chrome 
for Linux 4.1.249. We also ported most of the IBOS 
browser components to Linux platform (noted as IBOS- 
Linux) to focus on the performance impact of our IBOS 
kernel architecture. In IBOS, we statically allocate pro- 
cessors for subsystems as follows: the kernel and device 
drivers run on CPUO, network processes run on CPU1, 
web page instances run on CPU2, and all other compo- 
nents run on CPU3. IBOS, IBOS-Linux, and Chrome all 
use a same version of WebKit from February 2010 with 
just-in-time JavaScript compilation and HTTP pipelining 
enabled. For the WebKit-based browsers, we instrument 
them to measure the time in between the initial URL re- 
quest and the DOM onload event. For Firefox, we use 
an extension that measures these same events. To reduce 
noise introduced by our network connection, we load 
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Figure 4: Page load latencies for IBOS and other web 
browsers. All latencies are shown in milliseconds. 





each web site using a fresh web page/browser instance 
with an empty cache 15 times and report the average of 
the five shortest page load latency times. 

In Figure 4, we present the page load latency times 
for six popular websites and show the standard devia- 
tions with the error bars. Overall, Chrome has the short- 
est page load latencies due to its effective optimization 
techniques. For maps.google.com, IBOS, IBOS- 
Linux, and Chrome out-perform Firefox, possibly due 
to optimization in the WebKit engine for this particular 
site. For www.bing.com, sfbay.craigslist. 
org and cs.illinois.edu, IBOS, IBOS-Linux, 
and Firefox show roughly the same results. IBOS has the 
fastest loading time for craigslist. Craigslist 
is a simple web site with few HTTP requests and with a 
large number of HTML elements. We hypothesize that 
the small performance improvement is due to the simpli- 
fied IBOS software stack. 

Both en.wikipedia.org/wiki/Main_Page 
and www. facebook.com have more HTTP requests 
than any of the other sites, and we observe slower page 
load latencies for IBOS than for other browsers. For 
these experiments IBOS performs slower than IBOS- 
Linux. Because we use the IBOS components in Linux, 
we believe that this performance difference occurs from 
overhead in the IBOS kernel. To test this hypothesis, we 
ran a number of micro benchmarks on the two systems 
and we believe that the overhead is due to contention for 
spinlocks in the L4 IPC implementation. The net effect 
of this contention is that heavy use of network processes 
requires heavy use of IPC, which adds latency to all IPC 
messages and slows down the overall system. However, 
the IBOS-Linux results for these experiments show that 
this slow down is not fundamental and can be fixed with 
a more mature kernel implementation. 

Overall, the page load latency experiments show that 
even with a prototype implementation of IBOS, our ar- 
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chitecture will not slow down the browsing speed signif- 
icantly for the web sites we tested. 


7 Additional related work 


7.1 Alternative kernel architectures 


Operating systems designed to reduce the trusted com- 
puting base for applications are not new. For example, 
several recent OSes propose using information flow to 
allow applications to specify information flow policies 
that are enforced by a thin kernel [18, 57, 33]; KeyKOS 
[12], EROS [45], and seL4 [32] provide capability sup- 
port using a small kernel; and Microkernels [24, 27, 28] 
push typical OS components into user space. In IBOS, 
we apply these principles to a new application — the web 
browser — and include support for user interface com- 
ponents and window manager operations. Also, these 
previous approaches support general purpose security 
mechanisms, like information flow and capabilities, and 
shared resources and device drivers are part of the TCB. 
The IBOS security policy is specific to web browsers, 
and although this is less general, we can track this pol- 
icy to hardware abstractions and can remove drivers and 
other shared components from our TCB. 

Both Exokernels [19, 31] and L4 [27] rethink low- 
layer software abstractions. In both projects, they ad- 
vocate exposing abstractions that are close to the under- 
lying hardware to enable applications to customize for 
improved performance. In IBOS we build on these pre- 
vious works — in fact we use the L4Ka::Pistachio L4 [8] 
MMU abstractions and message passing implementation 
directly. However, the key difference between our work 
and L4 and Exokernel is that we expose high-level ap- 
plication abstractions at our lowest layer of software, not 
low-level hardware abstractions. Our focus is on making 
web browsers more secure and the system software we 
use to accomplish this improved security. 


7.2 Browser security 


A number of recent papers have proposed new browser 
architectures including SubOS [29, 30], safe web pro- 
grams [44], OP [26], Chrome [11, 43], Gazelle [52], and 
ServiceOS [38]. Although the browser portion of IBOS 
does resemble some of these works, they all run on top of 
commodity OSes and include complex libraries and win- 
dow managers in their TCB, something that IBOS avoids 
by focusing on the OS architecture of our system. 

The webOS from Palm [40] and the upcoming 
ChromeOS from Google [25] run a web browser on top 
of a Linux kernel. ChromeOS includes kernel harden- 
ing using trusted boot, mandatory access controls, and 
sandboxing mechanisms for reducing the attack surface 
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of their system. However, ChromeOS and IBOS have 
fundamentally different design philosophies. ChromeOS 
starts with a large and complex system and tries to re- 
move and restrict the unused and unneeded portions of 
the system. In contrast, IBOS starts with a clean slate 
and only adds to our system functionality needed for 
our browser. Although our approach does require im- 
plementing from scratch low-level software and fitting 
device drivers to a new driver model, the end result has 2 
to 3 orders of magnitude fewer lines of code in the TCB, 
while still retaining nearly all of the same functionality. 


In the Tahoma browser [15], the authors propose using 
virtual machine monitors (VMMs) to enable web appli- 
cations to specify code that runs on the client. Tahoma 
uses server-side manifests to specify the security pol- 
icy for the downloaded code and the VMM enforces 
this security policy. Tahoma does expose a few browser 
abstractions from their VMM to help manage UI ele- 
ments and network connections, but operates mostly on 
hardware-level abstractions. Because Tahoma operates 
on hardware-level abstractions, Tahoma is unable to pro- 
vide full backwards-compatible web semantics from the 
VMM and more fine-grained protection for browsers, 
such as isolating i frames embedded in a web applica- 
tion. Also, many modern VMMs use a full-blown com- 
modity OS in a privileged virtual machine or host OS for 
driver support, leaving tens of millions of lines of code 
in the TCB potentially. 


7.3 Device driver security 


Device driver security has focused on three main topics. 
First, several projects focus on restricting driver access to 
I/O ports and device access to main memory via DMA. 
For example, RVM uses a software-only approach to re- 
strict DMA access of devices [55], SVA prevents the OS 
from accessing driver registers via memory mapped I/O 
through memory safety checks [16], and Mungi [35] re- 
lies on using a hardware IOMMU to limit which mem- 
ory regions are accessible from devices. Second, sys- 
tem designers isolate drivers from the rest of the system. 
This isolation can be achieved by running drivers in user- 
mode, which has been a staple of Microkernel systems 
[24, 36, 28], using software to protect the OS from ker- 
nel drivers [20, 58], or by using page table protections 
within the OS [49, 48]. The driver security architec- 
ture in IBOS differs from these approaches because our 
system provides fine-grained protection for individual re- 
quests within a shared driver in addition to isolating the 
driver from the rest of the system. 


7.4 Secure window managers 


A number of recent projects have looked at reducing the 
TCB for window managers. For example DoPE [21] and 
Nitpicker [22] move widget rendering from the server 
to the client, leaving the server to only manage shared 
buffers. CMW [56], EWS [46], and TrustGraph [39] also 
use clients for rendering, but are able to apply capabili- 
ties and mandatory access control policies to application 
user-interface elements. In IBOS, we deprecate the gen- 
eral window notion of modern computer systems in favor 
of the simpler browser chrome and tab motif, allowing 
us to track our security policies down to the underlying 
graphics hardware on our system. 


8 Conclusions 


In this paper, we presented IBOS, an operating system 
and web browser co-designed to reduce drastically the 
trusted computing base for web browsers and to sim- 
plify browsing systems. To achieve this improvement, 
we built IBOS with browser abstractions as first-class OS 
abstractions and removed traditional shared system com- 
ponents and services from its TCB. With our new archi- 
tecture, we showed that IBOS enforced traditional and 
novel security policies, and we argued that the overall 
system security and usability could withstand successful 
attacks on device drivers, browser components, or tradi- 
tional applications. Our experimental results showed that 
IBOS added little overhead when compared to today’s 
high-performance browsers running on fast and mature 
commodity operating systems. 


Acknowledgment 


We would like to thank Brad Chen, Steve Gribble, and 
Hank Levy for their feedback on our security analy- 
sis. We would also like to thank our shepherd Nickolai 
Zeldovich, Anthony Cozzie, and Matt Hicks who pro- 
vided valuable feedback on this paper. This research 
was funded in part by NSF grants CNS 0834738 and 
CNS 0831212, grant NO014-09-1-0743 from the Office 
of Naval Research, AFOSR MURI grant FA9550-09-01- 
0539, and by a grant from the Internet Services Research 
Center (ISRC) of Microsoft Research. 


References 


[1] CVE - Common Vulnerabilities and Exposures (CVE). http: 
//cve.mitre.org. 


[2 


far} 


Gecko plugin API reference. https://developer. 
mozilla.org/en/Gecko_Plugin_API_Reference. 


[3] Qt - A Cross-platform application and UI. 
nokia.com/. 


http://qt. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 29 


30 








[10 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


Symantec internet security threat report april 2010. 
http://www. symantec.com/business/theme. 


jsp?themeid=threatreport. 
The WebKit Open Source Project. http: //webkit.org/. 
uClibe. http: //www.uclibc.org/. 


Qt labs blogs: So long and thanks for the blit, 2008. 
http://labs.trolltech.com/blogs/2008/10/ 
22/so-long-and-thanks-for-the-blit/. 


L4Ka::Pistachio microkernel, 2010. 
projects/pistachio. 


http://14ka.org/ 


ANDERSON, J. P. Computer security technology planning study. 
Tech. rep., HQ Electronic Systems Division (AFSC), October 
1972. ESD-TR-73-51. 


BARTH, A., CABALLERO, J., AND SONG, D. Secure content 
sniffing for web browsers or how to stop papers from reviewing 
themselves. In Proceedings of the IEEE Symposium on Security 
and Privacy (May 2009). 


BARTH, A., JACKSON, C., REIS, C., AND THE 
GOOGLE CHROME TEAM. The security archi- 
tecture of the chromium browser, 2008. http: 


//crypto.stanford.edu/websec/chromium/ 
chromium-security-architecture.pdf. 


BOMBERGER, A. C., FRANTZ, W. S., HARDY, A. C., HARDY, 
N., LANDAU, C. R., AND SHAPIRO, J. S. The KeyKOS nanok- 
ernel architecture. In Proceedings of the Workshop on Micro- 
kernels and Other Kernel Architectures (Berkeley, CA, USA, 
1992), USENIX Association, pp. 95-112. 


CHEN, S., MESEGUER, J., SASSE, R., WANG, H. J., AND 
WANG, Y.-M. A systematic approach to uncover security flaws 
in GUI logic. In Proceedings of the 2007 IEEE Symposium on 
Security and Privacy (May 2007), pp. 71-85. 


CHEN, S., Ross, D., AND WANG, Y.-M. An analysis of 
browser domain-isolation bugs and a light-weight transparent de- 
fense mechanism. In Proceedings of the 14th ACM Conference on 
Computer and Communications Security (CCS) (2007), pp. 2-11. 


Cox, R. S., HANSEN, J. G., GRIBBLE, S. D., AND LEvy, 
H. M. A safety-oriented platform for web applications. In Pro- 
ceedings of the 2006 IEEE Symposium on Security and Privacy 
(May 2006), pp. 350-364. 


CRISWELL, J., GEOFFRAY, N., AND ADVE, V. Memory safety 
for low-level software/hardware interactions. In Proceedings of 
the Eighteenth Usenix Security Symposium (August 2009). 


DUNKELS, A., WOESTENBERG, L., MANSLEY, 
AND MONOSES, J. lwIP embedded TCP/IP 
http://savannah.nongnu.org/projects/lwip/, 2004. 


K., 
stack. 


EFSTATHOPOULOS, P., KROHN, M., VANDEBOGART, S., 
FREY, C., ZIEGLER, D., KOHLER, E., MAZIERES, D., 
KAASHOEK, F., AND MORRIS, R. Labels and event processes 
in the asbestos operating system. In SOSP ’05: Proceedings of 
the Twentieth ACM Symposium on Operating Systems Principles 
(New York, NY, USA, 2005), ACM, pp. 17-30. 


ENGLER, D. R., KAASHOEK, M. F., AND JR., J. O. Exok- 
ernel: an operating system architecture for application-level re- 
source management. In Proceedings of the 1995 Symposium on 
Operating Systems Principles (December 1995), pp. 251-266. 


ERLINGSSON, U., ABADI, M., VRABLE, M., BUDIU, M., AND 
NECULA, G. C. Xfi: software guards for system address spaces. 
In OSDI ’06: Proceedings of the 7th symposium on Operating 
systems design and implementation (Berkeley, CA, USA, 2006), 
USENIX Association, pp. 75-88. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


[31] 


[32] 


[33] 


[34] 


[35] 


[36] 


FESKE, N., AND HARTIG, H. DOpE - a window server for real- 
time and embedded systems. In RTSS ’03: Proceedings of the 
24th IEEE International Real-Time Systems Symposium (Wash- 
ington, DC, USA, 2003), IEEE Computer Society, p. 74. 


FESKE, N., AND HELMUTH, C. A Nitpicker’s guide to a 
minimal-complexity secure GUI. In ACSAC ’05: Proceedings 
of the 21st Annual Computer Security Applications Conference 
(Washington, DC, USA, 2005), IEEE Computer Society, pp. 85— 
94. 


GARFINKEL, T. Traps and Pitfalls: Practical Problems in Sys- 
tem Call Interposition Based Security Tools. In Proceedings of 
the 2003 Network and Distributed System Security Symposium 
(NDSS) (February 2003). 


GOLUB, D., DEAN, R., FORIN, A., AND RASHID, R. Unix 
as an Application Program. In Proceedings of the 1990 USENIX 
Summer Conference (1990). 


GOOGLE INC. Chromium OS, 2010. 
chromium.org/chromium-os. 


http://www. 


GRIER, C., TANG, S., AND KING, S. T. Secure web brows- 
ing with the OP web browser. In Proceedings of the 2008 IEEE 
Symposium on Security and Privacy (May 2008), pp. 402-416. 


HARTIG, H., HOHMUTH, M., LIEDTKE, J., WOLTER, J., AND 
SCHONBERG, S. The performance of ju-kernel-based systems. In 
SOSP ’97: Proceedings of the sixteenth ACM Symposium on Op- 
erating Systems Principles (New York, NY, USA, 1997), ACM, 
pp. 66-77. 


HERDER, J. N., BOs, H., GRAS, B., HOMBURG, P., AND 
TANENBAUM, A. S. MINIX 3: a highly reliable, self-repairing 
operating system. SIGOPS Oper. Syst. Rev. 40, 3 (2006), 80-89. 


IOANNIDIS, S., AND BELLOVIN, S. M. Building a secure web 
browser. In Proceedings of the FREENIX Track: 2001 USENIX 
Annual Technical Conference (June 2001). 


IOANNIDIS, S., BELLOVIN, S. M., AND SMITH, J. Sub- 
operating systems: A new approach to application security. In 
SIGOPS European Workshop (September 2002). 


KAASHOEK, M. F., ENGLER, D. R., GANGER, G. R., 
BRICENO, H. M., HUNT, R., MAZIERES, D., PINCKNEY, T., 
GRIMM, R., JANNOTTI, J.. AND MACKENZIE, K. Application 
performance and flexibility on exokernel systems. In SOSP ’97: 
Proceedings of the sixteenth ACM symposium on Operating sys- 
tems principles (New York, NY, USA, 1997), ACM, pp. 52-65. 


KLEIN, G., ELPHINSTONE, K., HEISER, G., ANDRONICK, J., 
Cock, D., DERRIN, P., ELKADUWE, D., ENGELHARDT, K., 
KOLANSKI, R., NORRISH, M., SEWELL, T., TUCH, H., AND 
WINWOOD, S. seL4: formal verification of an os kernel. In 
SOSP ’09: Proceedings of the ACM SIGOPS 22nd symposium 
on Operating systems principles (New York, NY, USA, 2009), 
ACM, pp. 207-220. 


KROHN, M., Yip, A., BRODSKY, M., CLIFFER, N., 
KAASHOEK, M. F., KOHLER, E., AND MORRIS, R. Informa- 
tion flow control for standard OS abstractions. In SOSP ’07: Pro- 
ceedings of twenty-first ACM Symposium on Operating Systems 
Principles (New York, NY, USA, 2007), ACM, pp. 321-334. 


LAWRENCE, E. Combating clickjacking with x- 
frame-options, March 2010. http://blogs.msdn. 
com/b/ieinternals/archive/2010/03/30/ 
combating-clickjacking-with-—x-frame-options. 
aspx. 

LESLIE, B., AND HEISER, G. Towards untrusted device drivers. 
Tech. rep., UNSW-CSE-TR-0303, 2003. 


LEVASSEUR, J., UHLIG, V., STOESS, J., AND GOTZ, S. Un- 
modified Device Driver Reuse and Improved System Dependabil- 
ity via Virtual Machines. In Proceedings of the 2004 Symposium 


USENIX Association 


[37] 


[38] 


[39] 


[40] 


[41] 


[42] 


[43] 


[44] 


[45] 


[46] 


[47] 


[48] 


[49] 


[50] 


[51] 


[52] 


USENIX Association 


on Operating Systems Design and Implementation (OSDI) (De- 
cember 2004). 


MOSHCHUK, A., BRAGIN, T., GRIBBLE, S. D., AND LEVY, 
H. M. A crawler-based study of spyware on the web. In Pro- 
ceedings of the 2006 Network and Distributed System Security 
Symposium (NDSS) (February 2006). 


MOSHCHUK, A., AND WANG, H. J. Resource Management for 
Web Applications in ServiceOS. Tech. rep., Microsoft Research, 
May 2010. 


OKHRAVI, H., AND NICOL, D. M. Trustgraph: Trusted graphics 
subsystem for high assurance systems. In ACSAC ’09: Proceed- 
ings of the 2009 Annual Computer Security Applications Con- 
ference (Washington, DC, USA, 2009), IEEE Computer Society, 
pp. 254-265. 


PALM INC. webOS, 2010. http://opensource.palm. 
com. 


PRrovos, N., MAVROMMATIS, P., RAJAB, M. A., AND MON- 
ROSE, F. All your iFRAMEs point to us. In Proceedings of the 
17th Usenix Security Symposium (July 2008), pp. 1-15. 


PRrovos, N., MCNAMEE, D., MAVROMMATIS, P., WANG, K., 
AND MODADUGU, N. The ghost in the browser: Analysis of 
Web-based malware. In Proceedings of the 2007 Workshop on 
Hot Topics in Understanding Botnets (HotBots) (April 2007). 


REIS, C., AND GRIBBLE, S. D. Isolating web programs in mod- 
ern browser architectures. In Proceedings of the 2009 EuroSys 
conference (2009). 


REIS, C., GRIBBLE, S. D., AND LEVy, H. M. Architec- 
tural principles for safe web programs. In Proceedings of the 
Sixth Workshop on Hot Topics in Networks (HotNets) (November 
2007). 


SHAPIRO, J. S., SMITH, J. M., AND FARBER, D. J. EROS: 
a fast capability system. In SOSP ’99: Proceedings of the sev- 
enteenth ACM symposium on Operating systems principles (New 
York, NY, USA, 1999), ACM, pp. 170-185. 


SHAPIRO, J. S., VANDERBURGH, J., NORTHUP, E., AND CHIZ- 
MADIA, D. Design of the EROS trusted window system. In Pro- 
ceedings of the 13th conference on USENIX Security Symposium 
(Berkeley, CA, USA, 2004), USENIX Association, pp. 12-12. 


SINGH, K., MOSHCHUK, A., WANG, H. J., AND LEE, W. On 
the incoherencies in web browser access control policies. In Pro- 
ceedings of the IEEE Symposium on Security and Privacy (May 
2010). 


SWIFT, M. M., ANNAMALAI, M., BERSHAD, B. N., AND 
LEvy, H. M. Recovering Device Drivers. In Proceedings of 
the 2004 Symposium on Operating Systems Design and Imple- 
mentation (OSDI) (December 2004). 


SWIFT, M. M., BERSHAD, B. N., AND LEvy, H. M. Improv- 
ing the reliability of commodity operating systems. In SOSP ’03: 
Proceedings of the nineteenth ACM symposium on Operating sys- 
tems principles (New York, NY, USA, 2003), ACM, pp. 207-222. 


SYMANTEC INC. Symantec global Internet security threat report: 
Trends for 2008, April 2009. http://www. symantec.com/ 
business/theme. jsp?themeid=threatreport. 


TAN, L., ZHANG, X., MA, X., XIONG, W., AND ZHOU, Y. Au- 
toISES: Automatically inferring security specifications and de- 
tecting violations. In Proceedings of the 17th USENIX Security 
Symposium (USENIX Security ’08) (July-August 2008). 


WANG, H. J., GRIER, C., MOSHCHUK, A., KING, S. T., 
CHOUDHURY, P., AND VENTER, H. The multi-principal OS 
construction of the Gazelle web browser. In Proceedings of the 
2009 USENIX Security Symposium (August 2009). 


[53] 


[54] 


[55] 


[56] 


[57] 


[58] 


WANG, Y.-M., BECK, D., JIANG, X., ROUSSEV, R., VER- 
BOWSKI, C., CHEN, S., AND KING, S. Automated Web Pa- 
trol with Strider HoneyMonkeys: Finding Web sites that exploit 
browser vulnerabilities. In Proceedings of the 2006 Network and 
Distributed System Security Symposium (NDSS) (February 2006). 


WHEELER, D. SLOCcount, 2009. http: //www.dwheeler. 
com/sloccount/. 


WILLIAMS, D., REYNOLDS, P., WALSH, K., SIRER, E. G., 
AND SCHNEIDER, F. B. Device driver safety through a reference 
validation mechanism. In OSDI 08: Proceedings of the 8th sym- 
posium on operating systems design and implementation (2008). 


WOODWARD, J. P. Security requirementes for systems high and 
compartemented mode workstations. Tech. rep., MITRE Corp., 
1987. MTR 9992. 


ZELDOVICH, N., BOYD-WICKIZER, S., KOHLER, E., AND 
MAZIERES, D. Making information flow explicit in HiStar. 
In OSDI ’06: Proceedings of the 7th symposium on Operating 
systems design and implementation (Berkeley, CA, USA, 2006), 
USENIX Association, pp. 263-278. 


ZHOU, F., CONDIT, J., ANDERSON, Z., BAGRAK, I., EN- 
NALS, R., HARREN, M., NECULA, G., AND BREWER, E. 
Safedrive: safe and recoverable extensions using language-based 
techniques. In OSDI ’06: Proceedings of the 7th symposium 
on Operating systems design and implementation (Berkeley, CA, 
USA, 2006), USENIX Association, pp. 45-60. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 31 


USENIX Association 


FlexSC: Flexible System Call Scheduling with Exception-Less System Calls 


Livio Soares 
University of Toronto 


Abstract 


For the past 30+ years, system calls have been the de facto 
interface used by applications to request services from the 
operating system kernel. System calls have almost uni- 
versally been implemented as a synchronous mechanism, 
where a special processor instruction is used to yield user- 
space execution to the kernel. In the first part of this 
paper, we evaluate the performance impact of traditional 
synchronous system calls on system intensive workloads. 
We show that synchronous system calls negatively affect 
performance in a significant way, primarily because of 
pipeline flushing and pollution of key processor structures 
(e.g., TLB, data and instruction caches, etc.). 

We propose a new mechanism for applications to 
request services from the operating system kernel: 
exception-less system calls. They improve processor effi- 
ciency by enabling flexibility in the scheduling of operat- 
ing system work, which in turn can lead to significantly in- 
creased temporal and spacial locality of execution in both 
user and kernel space, thus reducing pollution effects on 
processor structures. Exception-less system calls are par- 
ticularly effective on multicore processors. They primar- 
ily target highly threaded server applications, such as Web 
servers and database servers. 

We present FlexSC, an implementation of exception- 
less system calls in the Linux kernel, and an accompany- 
ing user-mode thread package (FlexSC-Threads), binary 
compatible with POSIX threads, that translates legacy 
synchronous system calls into exception-less ones trans- 
parently to applications. We show how FlexSC improves 
performance of Apache by up to 116%, MySQL by up to 
40%, and BIND by up to 105% while requiring no modi- 
fications to the applications. 


1 Introduction 


System calls are the de facto interface to the operating sys- 
tem kernel. They are used to request services offered by, 
and implemented in the operating system kernel. While 
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Figure 1: User-mode instructions per cycles (IPC) of Xalan 
(from SPEC CPU 2006) in response to a system call exception 
event, as measured on an Intel Core 17 processor. 


different operating systems offer a variety of different ser- 
vices, the basic underlying system call mechanism has 
been common on all commercial multiprocessed operat- 
ing systems for decades. System call invocation typically 
involves writing arguments to appropriate registers and 
then issuing a special machine instruction that raises a 
synchronous exception, immediately yielding user-mode 
execution to a kernel-mode exception handler. Two im- 
portant properties of the traditional system call design are 
that: (1) a processor exception is used to communicate 
with the kernel, and (2) a synchronous execution model is 
enforced, as the application expects the completion of the 
system call before resuming user-mode execution. Both of 
these effects result in performance inefficiencies on mod- 
ern processors. 

The increasing number of available transistors on a chip 
(Moore’s Law) has, over the years, led to increasingly 
sophisticated processor structures, such as superscalar 
and out-of-order execution units, multi-level caches, and 
branch predictors. These processor structures have, in 
turn, led to a large increase in the performance poten- 
tial of software, but at the same time there is a widening 
gap between the performance of efficient software and the 
performance of inefficient software, primarily due to the 
increasing disparity of accessing different processor re- 
sources (e.g., registers vs. caches vs. memory). Server 
and system-intensive workloads, which are of particular 
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interest in our work, are known to perform well below the 
potential processor throughput [11, 12, 19]. Most studies 
attribute this inefficiency to the lack of locality. We claim 
that part of this lack of locality, and resulting performance 
degradation, stems from the current synchronous system 
call interface. 


Synchronous implementation of system calls negatively 
impacts the performance of system intensive workloads, 
both in terms of the direct costs of mode switching and, 
more interestingly, in terms of the indirect pollution of 
important processor structures which affects both user- 
mode and kernel-mode performance. A motivating ex- 
ample that quantifies the impact of system call pollution 
on application performance can be seen in Figure |. It 
depicts the user-mode instructions per cycles (kernel cy- 
cles and instructions are ignored) of one of the SPEC CPU 
2006 benchmarks (Xalan) immediately before and after a 
pwrite system call. There is a significant drop in in- 
structions per cycle (IPC) due to the system call, and it 
takes up to 14,000 cycles of execution before the IPC of 
this application returns to its previous level. As we will 
show, this performance degradation is mainly due to inter- 
ference caused by the kernel on key processor structures. 


To improve locality in the execution of system intensive 
workloads, we propose a new operating system mecha- 
nism: the exception-less system call. An exception-less 
system call is a mechanism for requesting kernel services 
that does not require the use of synchronous processor ex- 
ceptions. In our implementation, system calls are issued 
by writing kernel requests to a reserved syscall page, us- 
ing normal memory store operations. The actual execu- 
tion of system calls is performed asynchronously by spe- 
cial in-kernel syscall threads, which post the results of 
system calls to the syscall page after their completion. 


Decoupling the system call execution from its invoca- 
tion creates the possibility for flexible system call schedul- 
ing, offering optimizations along two dimensions. The 
first optimization allows for the deferred batch execution 
of system calls resulting in increased temporal locality of 
execution. The second provides the ability to execute sys- 
tem calls on a separate core, in parallel to executing user- 
mode threads, resulting in spatial, per core locality. In 
both cases, system call threads become a simple, but pow- 
erful abstraction. 


One interesting feature of the proposed decoupled sys- 
tem call model is the possibility of dynamic core special- 
ization in multicore systems. Cores can become temporar- 
ily specialized for either user-mode or kernel-mode execu- 
tion, depending on the current system load. We describe 
how the operating system kernel can dynamically adapt 
core specialization to the demands of the workload. 

One important challenge of our proposed system is how 
to best use the exception-less system call interface. One 
option is to rewrite applications to directly interface with 
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the exception-less system call mechanism. We believe the 
lessons learned by the systems community with event- 
driven servers indicate that directly using exception-less 
system calls would be a daunting software engineer- 
ing task. For this reason, we propose a new M-on-N 
threading package (JV user-mode threads executing on NV 
kernel-visible threads, with 1/ >> N). The main purpose 
of this threading package is to harvest independent sys- 
tem calls by switching threads, in user-mode, whenever a 
thread invokes a system call. 
This research makes the following contributions: 


e We quantify, at fine granularity, the impact of syn- 
chronous mode switches and system call execution on 
the micro-architectural processor structures, as well as 
on the overall performance of user-mode execution. 


e We propose a new operating system mechanism, the 
exception-less system call, and describe an implemen- 
tation, FlexSC!, in the Linux kernel. 


e We present a M-on-N threading system, compati- 
ble with PThreads, that transparently uses the new 
exception-less system call facility. 


e We show how exception-less system calls coupled with 
our M-on-N threading system improves performance 
of important system-intensive highly threaded work- 
loads: Apache by up to 116%, MySQL by to 40%, and 
BIND by up to 105%. 


2 The (Real) Costs of System Calls 


In this section, we analyze the performance costs associ- 
ated with a traditional, synchronous system call. We ana- 
lyze these costs in terms of mode switch time, the system 
call footprint, and the effect on user-mode and kernel- 
mode IPC. We used the Linux operating system kernel 
and an Intel Nehalem (Core i7) processor, along with its 
performance counters to obtain our measurements. How- 
ever, we believe the lessons learned are applicable to most 
modern high-performance processors” and other operat- 
ing system kernels. 


2.1 Mode Switch Cost 


Traditionally, the performance cost attributed to system 
calls is the mode switch time. The mode switch time con- 
sists of the time necessary to execute the appropriate sys- 
tem call instruction in user-mode, resuming execution in 
an elevated protection domain (kernel-mode), and the re- 
turn of control back to user-mode. Modern processors im- 
plement the mode switch as a processor exception: flush- 
ing the user-mode pipeline, saving a few registers onto the 


Pronounced as “flex” (/’fleks/). 
2Experiments performed on an older PowerPC 970 processor yielded 
similar insights than the ones presented here. 
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Syscall Instructions | Cycles | IPC | i-cache | d-cache L2 L3 | d-TLB 
stat 4972 | 13585 | 0.37 32 186 660 | 2559 21 
pread 3739 12300 | 0.30 32. 294 679 | 2160 20 
pwrite 5689 | 31285 | 0.18 50 373 985 | 3160 44 
open+close 6631 19162 | 0.34 47 240 900 | 3534 28 
mmap-+munmap 8977 | 19079 | 0.47 41 233 869 | 3913 7 
open+write+close 9921 32815 | 0.30 78 481 1462 | 5105 49 



































Table 1: System call footprint of different processor structures. For the processors structures (caches and TLB), the numbers represent 
number of entries evicted; the cache line for the processor is of 64-bytes. i-cache and d-cache refer to the instruction and data sections 
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of the L1 cache, respectively. The d-TLB represents the data portion of the TLB. 


kernel stack, changing the protection domain, and redi- 
recting execution to the registered exception handler. Sub- 
sequently, return from exception is necessary to resume 
execution in user-mode. 

We measured the mode switch time by implement- 
ing a new system call, gettsc that obtains the time 
stamp counter of the processor and immediately returns 
to user-mode. We created a simple benchmark that in- 
voked gettsc | billion times, recording the time-stamp 
before and after each call. The difference between each 
of the three time-stamps identifies the number of cycles 
necessary to enter and leave the operating system kernel, 
namely 79 cycles and 71 cycles, respectively. The total 
round-trip time for the gett sc system call is modest at 
150 cycles, being less than the latency of a memory ac- 
cess that misses the processor caches (250 cycles on our 
machine).? 


2.2 System Call Footprint 


The mode switch time, however, is only part of the cost of 
a system call. During kernel-mode execution, processor 
structures including the L1 data and instruction caches, 
translation look-aside buffers (TLB), branch prediction ta- 
bles, prefetch buffers, as well as larger unified caches (L2 
and L3), are populated with kernel specific state. The re- 
placement of user-mode processor state by kernel-mode 
processor state is referred to as the processor state pollu- 
tion caused by a system call. 

To quantify the pollution caused by system calls, we 
used the Core i7 hardware performance counters (HPC). 
We ran a high instruction per cycle (IPC) workload, 
Xalan, from the SPEC CPU 2006 benchmark suite that 
is known to invoke few system calls. We configured an 
HPC to trigger infrequently (once every 10 million user- 
mode instructions) so that the processor structures would 
be dominated with application state. We then set up the 
HPC exception handler to execute specific system calls, 
while measuring the replacement of application state in 
the processor structures caused by kernel execution (but 
not by the performance counter exception handler itself). 


3For all experiments presented in this paper, user-mode applications 
execute in 64-bit mode and when using synchronous system calls, use 
the “syscall” x86_64 instruction, which is currently the default in Linux. 


Table | shows the footprint on several processor struc- 
tures for three different system calls and three system call 
combinations. The data shows that, even though the num- 
ber of i-cache lines replaced is modest (between 2 and 
5 KB), the number of d-cache lines replaced is signifi- 
cant. Given that the size of the d-cache on this processor 
is 32 KB, we see that the system calls listed pollute at 
least half of the d-cache, and almost all of the d-cache in 
the “open-+write+close” case. The 64 entry first level d- 
TLB is also significantly polluted by most system calls. 
Finally, it is interesting to note that the system call impact 
on the L2 and L3 caches is larger than on the L1 caches, 
primarily because the L2 and L3 caches use more aggres- 
sive prefetching. 


2.3 System Call Impact on User IPC 


Ultimately, the most important measure of the real cost 
of system calls is the performance impact on the applica- 
tion. To quantify this, we executed an experiment similar 
to the one described in the previous subsection. However, 
instead of measuring kernel-mode events, we only mea- 
sured user-mode instructions per cycle (IPC), ignoring all 
kernel execution. Ideally, user-mode IPC should not de- 
crease as a result of invoking system calls, since the cy- 
cles and instructions executed as part of the system call 
are ignored in our measurements. In practice, however, 
user-mode IPC is affected by two sources of overhead: 


Direct: The processor exception associated with the sys- 
tem call instruction that flushes the processor pipeline. 


Indirect: System call pollution on the processor struc- 
tures, as quantified in Table 1. 


Figures 2 and 3 show the degradation in user-mode IPC 
when running Xalan (from SPEC CPU 2006) and SPEC- 
JBB, respectively, given different frequencies of pwrite 
calls. These benchmarks were chosen since they have 
been created to avoid significant use of system services, 
and should spend only 1-2% of time executing in kernel- 
mode. The graphs show that different workloads can have 
different sensitivities to system call pollution. Xalan has 
a baseline user-mode IPC of 1.46, but the IPC degrades 
by up to 65% when executing a pwrite every 1,000- 
2,000 instructions, yielding an IPC between 0.58 and 0.50. 
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Figure 2: System call (pwrite) impact on user-mode IPC as a 
function of system call frequency for Xalan. 





, Of Indirect 
40% O Direct 











Degradation 
(lower is faster) 


1K 2K 5K 10K 20K 50K 100K 500K 
instructions between interrupts (log scale) 


Figure 3: System call (pwrite) impact on user-mode IPC as a 
function of system call frequency for SPEC JBB. 


SPEC-JBB has a slightly lower baseline of 0.97, but still 
observes a 45% degradation of user-mode IPC. 

The figures also depict the breakdown of user-mode 
IPC degradation due to direct and indirect costs. The 
degradation due to the direct cost was measured by issu- 
ing a null system call, while the indirect portion is cal- 
culated subtracting the direct cost from the degradation 
measured when issuing a pwrite system call. For high 
frequency system call invocation (once every 2,000 in- 
structions, or less), the direct cost of raising an exception 
and subsequent flushing of the processor pipeline is the 
largest source of user-mode IPC degradation. However, 
for medium frequencies of system call invocation (once 
per 2,000 to 100,000 instructions), the indirect cost of sys- 
tem calls is the dominant source of user-mode IPC degra- 
dation. 

To understand the implication of these results on typi- 
cal server workloads, it is necessary to quantify the sys- 
tem call frequency of these workloads. The average user- 
mode instruction count between consecutive system calls 
for three popular server workloads are shown in Table 2. 
For this frequency range in Figures 2 and 3 we observe 
user-mode IPC performance degradation between 20% 
and 60%. While the excecution of the server workloads 
listed in Table 2 is not identical to that of Xalan or SPEC- 


Workload (server) Instructions per Syscall 





DNSbench (BIND) 2445 
ApacheBench (Apache) 3368 
Sysbench (MySQL) 12435 





Table 2: The average number of instructions executed on differ- 
ent workloads before issuing a syscall. 
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Figure 4: System call (pwrite), impact on kernel-mode IPCs 
for x as a function of system call frequency. 


JBB, the data presented here indicates that server work- 
loads suffer from significant performance degradation due 
to processor pollution of system calls. 


2.4 Mode Switching Cost on Kernel IPC 


The lack of locality due to frequent mode switches also 
negatively affects kernel-mode IPC. Figure 4 shows the 
impact of different system call frequencies on the kernel- 
mode IPC. As expected, the performance trend is opposite 
to that of user-mode execution. The more frequent the 
system calls, the more kernel state is maintained in the 
processor. 

Note that the kernel-mode IPC listed in Table | for dif- 
ferent system calls ranges from 0.18 to 0.47, with an av- 
erage of 0.32. This is significantly lower than the 1.47 
and 0.97 user-mode IPC for Xalan and SPEC-JBB, re- 
spectively; up to 8x slower. 


3  Exception-Less System Calls 


To address (and partially eliminate) the performance im- 
pact of traditional, synchronous system calls on system 
intensive workloads, we propose a new operating system 
mechanism called exception-less system call. Exception- 
less system call is a mechanism for requesting kernel ser- 
vices that does not require the use of synchronous pro- 
cessor exceptions. The key benefit of exception-less sys- 
tem calls is the flexibility in scheduling system call execu- 
tion, ultimately providing improved locality of execution 
of both user and kernel code. We explore two use cases: 


System call batching: Delaying the execution of a series 
of system calls and executing them in batches minimizes 
the frequency of switching between user and kernel execu- 
tion, eliminating some of the mode switch overhead and 
allowing for improved temporal locality. This improves 
both the direct and indirect costs of system calls. 


Core specialization: In multicore systems, exception- 
less system calls allow a system call to be scheduled on 
a core different than the one on which the system call was 
invoked. Scheduling system calls on a separate processor 
core allows for improved spatial locality and with it lower 
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(a) Traditional, sync. system call (b) Exception-less system call 


Figure 5: Illustration of synchronous and exception-less system 
call invocation. The left diagram shows the sequential nature 
of exception-based system calls, while the right diagram depicts 
exception-less user and kernel communication through shared 
memory. 
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Figure 6: 64-byte syscall entry from the syscall page. 


indirect costs. In an ideal scenario, no mode switches are 
necessary, eliminating the direct cost of system calls. 


The design of exception-less system calls consists of 
two components: (1) an exception-less interface for user- 
space threads to register system calls, along with (2) an 
in-kernel threading system that allows the delayed (asyn- 
chronous) execution of system calls, without interrupting 
or blocking the thread in user-space. 


3.1 Exception-Less Syscall Interface 


The interface for exception-less system calls is simply a 
set of memory pages that is shared amongst user and ker- 
nel space. The shared memory page, henceforth referred 
to as syscall page, is organized to contain exception-less 
system call entries. Each entry contains space for the re- 
quest status, system call number, arguments, and return 
value. 

With traditional synchronous system calls, invocation 
occurs by populating predefined registers with system call 
information and issuing a specific machine instruction that 
immediately raises an exception. In contrast, to issue an 
exception-less system call, the user-space threads must 
find a free entry in the syscall page and populate the en- 
try with the appropriate values using regular store instruc- 
tions. The user-space thread can then continue executing 
without interruption. It is the responsibility of the user- 
space thread to later verify the completion of the system 
call by reading the status information in the entry. None 
of these operations, issuing a system call or verifying its 
completion, causes exceptions to be raised. 


3.2 Syscall Pages 


Syscall pages can be viewed as a table of syscall en- 
tries, each containing information specific to a single sys- 
tem call request, including the system call number, ar- 
guments, status (free/submitted/busy/done), and the result 


(Figure 6). In our 64-bit implementation, we have orga- 
nized each entry to occupy 64 bytes. This size comes from 
the Linux ABI which allows any system call to have up to 
6 arguments, and a return value, totalling 56 bytes. Al- 
though the remaining 3 fields (syscall number, status and 
number of arguments) could be packed in less than the 
remaining 8 bytes, we selected 64 bytes because 64 is a 
divisor of popular cache line sizes of today’s processor. 
To issue an exception-less system call, the user-space 
thread must find an entry in one of its syscall pages that 
contain a free status field. It then writes the syscall num- 
ber and arguments to the entry. Lastly, the status field is 
changed to submitted’, indicating to the kernel that the re- 
quest is ready for execution. The thread must then check 
the status of the entry until it becomes done, consume the 
return value, and finally set the status of the entry to free. 


3.3. Decoupling Execution from Invocation 


Along with the exception-less interface, the operating sys- 
tem kernel must support delayed execution of system 
calls. Unlike exception-based system calls, the exception- 
less system call interface does not result in an explicit ker- 
nel notification, nor does it provide an execution stack. To 
support decoupled system call execution, we use a spe- 
cial type of kernel thread, which we call syscall thread. 
Syscall threads always execute in kernel mode, and their 
sole purpose is to pull requests from syscall pages and ex- 
ecute them on behalf of the user-space thread. Figure 5 
illustrates the difference between traditional synchronous 
system calls, and our proposed split system call model. 

The combination of the exception-less system call in- 
terface and independent syscall threads allows for great 
flexibility in the scheduling the execution of system calls. 
Syscall threads may wake up only after user-space is un- 
able to make further progress, in order to achieve tempo- 
ral locality of execution on the processor. Orthogonally, 
syscall threads can be scheduled on a different processor 
core than that of the user-space thread, allowing for spa- 
tial locality of execution. On modern multicore proces- 
sors, cache to cache communication is relatively fast (in 
the order of 10s of cycles), so communicating the entries 
of syscall pages from a user-space core to a kernel core, or 
vice-versa, should only cause a small number of processor 
stalls. 


3.4 Implementation — FlexSC 


Our implementation of the exception-less system call 
mechanism is called FlexSC (Flexible System Call) and 
was prototyped as an extension to the Linux kernel. Al- 
though our implementation was influenced by a mono- 


4User-space must update the status field last, with an appropriate 
memory barrier, to prevent the kernel from selecting incomplete syscall 
entries to execute. 
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lithic kernel architecture, we believe that most of our de- 
sign could be effective with other kernel architectures, 
e.g., exception-less micro-kernel IPCs, and hypercalls in 
a paravirtualized environment. 

We have implemented FlexSC for the x86_64 and 
PowerPC64 processor architectures. Porting FlexSC to 
other architectures is trivial; a single function is needed, 
which moves arguments from the syscall page to appropri- 
ate registers, according to the ABI of the processor archi- 
tecture. Two new system calls were added to Linux as part 
of FlexSC, flexsc_register and flexsc_wait. 


flexsc_register() This system call is used by pro- 
cesses that wish to use the FlexSC facility. Making this 
registration procedure explicit is not strictly necessary, as 
processes can be registered with FlexSC upon creation. 
We chose to make it explicit mainly for convenience of 
prototyping, giving us more control and flexibility in user- 
space. One legitimate reason for making registration ex- 
plicit is to avoid the extra initialization overheads incurred 
for processes that do not use exception-less system calls. 


Invocation of the flexsc_register system call must 
use the traditional, exception-based system call interface 
to avoid complex bootstrapping; however, since this sys- 
tem call needs to execute only once, it does not impact 
application performance. Registration involves two steps: 
mapping one or more syscall pages into user-space virtual 
memory space, and spawning one syscall thread per entry 
in the syscall pages. 


flexsc_wait() The decoupled execution model of 
exception-less system calls creates a challenge in user- 
space execution, namely what to do when the user-space 
thread has nothing more to execute and is waiting on 
pending system calls. With the proposed execution model, 
the OS kernel loses the ability to determine when a user- 
space thread should be put to sleep. With synchronous 
system calls, this is simply achieved by putting the thread 
to sleep while it is executing a system call if the call blocks 
waiting for a resource. 


The solution we adopted is to require that the user ex- 
plicitly communicate to the kernel that it cannot progress 
until one of the issued system calls completes by invok- 
ing the flexsc_wait system call. We implemented 
flexsc_wait as an exception-based system call, since 
execution should be synchronously directed to the kernel. 
FlexSC will later wake up the user-space thread when at 
least one of posted system calls are complete. 


3.5 Syscall Threads 


Syscall threads is the mechanism used by FlexSC to allow 
for exception-less execution of system calls. The Linux 
system call execution model has influenced some imple- 
mentation aspects of syscall threads in FlexSC: (1) the vir- 
tual address space in which system call execution occurs 
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is the address space of the corresponding process, and (2) 
the current thread context can be used to block execution 
should a necessary resource not be available (for example, 
waiting for I/O). 

To resolve the virtual address space requirement, 
syscall threads are created during flexsc_register. 
Syscall threads are thus “cloned” from the registering pro- 
cess, resulting in threads that share the original virtual ad- 
dress space. This allows the transfer of data from/to user- 
space with no modification to Linux’s code. 

FlexSC would ideally never allow a syscall thread to 
sleep. If a resource is not currently available, notification 
of the resource becoming available should be arranged, 
and execution of the next pending system call should be- 
gin. However, implementing this behavior in Linux would 
require significant changes and a departure from the basic 
Linux architecture. Instead, we adopted a strategy that al- 
lows FlexSC to maintain the Linux thread blocking archi- 
tecture, as well as requiring only minor modifications (3 
lines of code) to Linux context switching code, by creat- 
ing multiple syscall threads for each process that registers 
with FlexSC. 

In fact, FlexSC spawns as many syscall threads as there 
are entries available in the syscall pages mapped in the 
process. This provisions for the worst case where ev- 
ery pending system call blocks during execution. Spawn- 
ing hundreds of syscall threads may seem expensive, but 
Linux in-kernel threads are typically much lighter weight 
than user threads: all that is needed is a task_struct 
and a small, 2-page, stack for execution. All the other 
structures (page table, file table, etc.) are shared with the 
user process. In total, only 1OKB of memory is needed 
per syscall thread. 

Despite spawning multiple threads, only one syscall 
thread is active per application and core at any given point 
in time. If system calls do not block all the work is exe- 
cuted by a single syscall thread, while the remaining ones 
sleep on a work-queue. When a syscall thread needs to 
block, for whatever reason, immediately before it is put 
to sleep, FlexSC notifies the work-queue. Another thread 
wakes-up and immediately starts executing the next sys- 
tem call. Later, when resources become free, current 
Linux code wakes up the waiting thread (in our case, a 
syscall thread), and resumes its execution, so it can post its 
result to the syscall page and return to wait in the FlexSC 
work-queue. 


3.6 FlexSC Syscall Thread Scheduler 


FlexSC implements a syscall thread scheduler that is re- 
sponsible for determining when and on which core sys- 
tem calls will execute. This scheduler is critical to per- 
formance, as it influences the locality of user and kernel 
execution. 
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On a single-core environment, the FlexSC scheduler 
assumes the user-space will attempt to post as many 
exception-less system calls as possible, and subsequently 
call flexsc_wait(). The FlexSC scheduler then 
wakes up an available syscall thread that starts executing 
the first system call. If the system call does not block, 
the same syscall thread continues to execute the next sub- 
mitted syscall entry. If the execution of a syscall thread 
blocks, the currently scheduled syscall thread notifies the 
scheduler to wake another thread to continue to execute 
more system calls. The scheduler does not wake up the 
user-space thread until all available system calls have been 
issued, and have either finished or are currently blocked 
with at least one system call having been completed. This 
is done to minimize the number of mode switches to user- 
space. 

For multicore execution, the scheduler biases execution 
of syscall threads on a subset of available cores, dynam- 
ically specializing cores according to the workload re- 
quirements. In our current implementation, this is done 
by attempting to schedule syscall threads using a prede- 
termined, static list of cores. Upon a scheduling decision, 
the first core on the list is selected. If a syscall thread of 
a process is currently running on that core, the next core 
on the list is selected as the target. If the selected core is 
not currently executing a syscall thread, an inter-processor 
interrupt is sent to the remote core, signalling that it must 
wake a syscall thread. 

As previously described, there is never more than one 
syscall thread concurrently executing per core, for a given 
process. However in the multicore case, for the same pro- 
cess, there can be as many syscall threads as cores con- 
currently executing on the entire system. To avoid cache- 
line contention of syscall pages amongst cores, before a 
syscall thread begins executing calls from a syscall page, 
it locks the page until all its submitted calls have been 
issued. Since FlexSC processes typically map multiple 
syscall pages, each core on the system can schedule a 
syscall thread to work independently, executing calls from 
different syscall pages. 


4 System Calls Galore — FlexSC-Threads 


Exception-less system calls present a significant change to 
the semantics of the system call interface with potentially 
drastic implications for application code and program- 
mers. Programming using exception-less system calls di- 
rectly is more complex than using synchronous system 
calls, as they do not provide the same, easy-to-reason- 
about sequentiality. In fact, our experience is that pro- 
gramming using exception-less system calls is akin to 
event-driven programming, which has itself been criti- 
cized for being a complex programming model [21]. The 
main difference is that with exception-less system calls, 


not only are I/O related calls scheduled for future comple- 
tion, any system calls can be requested, verified for com- 
pletion, and handled, as if it were an asynchronous event. 

To address the programming complexities, we propose 
the use of exception-less system calls in two different 
modes that might be used depending on the concurrency 
model adopted by the programmer. We argue that if used 
according to our recommendations, exception-less sys- 
tem calls should pose no more complexity than their syn- 
chronous counter-parts. 


4.1 Event-driven Servers, a Case for Hybrid 
Execution 


For event-driven systems, we advocate a hybrid approach 
where both synchronous and exception-less system calls 
coexist. System calls that are executed in performance 
critical paths of applications should use exception-less 
calls while all other calls should be synchronous. After 
all, there is no good justification to make a simple getpid() 
complex to program. 

Event-driven servers already have their code structured 
so that performance critical paths of execution are split 
into three parts: request event, wait for completion and 
handle event. Adapting an event-driven server to use 
exception-less system calls, for the already considered 
events, should be straightforward. However, we have not 
yet attempted to evaluate the use of exception-less system 
calls in an event-driven program, and leave this as future 
work. 


4.2 FlexSC-Threads 


Multiprocessing has become the default for computation 
on servers. With the emergence and ubiquity of multi- 
core processors, along with projection of future chip man- 
ufacturing technologies, it is unlikely that this trend will 
reverse in the medium future. For this reason, and be- 
cause of its relative simplicity vis-a-vis event-based pro- 
gramming, we believe that the multithreading concur- 
rency model will continue to be the norm. 

In this section, we describe the design and implementa- 
tion of FlexSC-Threads, a threading package that trans- 
forms legacy synchronous system calls into exception- 
less ones transparently to applications. It is intended 
for server-type applications with many user-mode threads, 
such as Apache or MySQL. FlexSC-Threads is compli- 
ant with POSIX Threads, and binary compatible with 
NPTL [8], the default Linux thread library. As a re- 
sult, Linux multi-threaded programs work with FlexSC- 
Threads “out of the box” without modification or recom- 
pilation. 

FlexSC-Threads uses a simple M-on-N threading 
model (M user-mode threads executing on N kernel- 
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Figure 7: The left-most diagram depicts the components of FlexSC-Threads pertaining to a single core. Each core executes a pinned 


kernel-visible thread, which in turn can multiplex multiple user-mode threads. 


Multiple syscall pages, and consequently syscall 


threads, are also allocated (and pinned) per core. The middle diagram depicts a user-mode thread being preempted as a result of 
issuing a system call. The right-most diagram depicts the scenario where all user-mode threads are waiting for system call requests; 
in this case FlexSC-Threads library synchronously invokes £lexsc_wait () to the kernel. 
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Figure 8: Multicore example. Opaque threads are active, while 
grayed-out threads are inactive. Syscall pages are accessible to 
both cores, as we run using shared-memory, leveraging the fast 
on-chip communication of multicores. 


visible threads). We rely on the ability to perform user- 
mode thread switching solely in user-space to transpar- 
ently transform legacy synchronous calls into exception- 
less ones. This is done as follows: 


1. We redirect to our library each libc call that issues a 
legacy system call. Typically, applications do not di- 
rectly embed code to issue system calls, but instead 
call wrappers in the dynamically loaded libc. We use 
the dynamic loading capabilities of Linux to redirect 
execution of such calls to our library. 


2. FlexSC-Threads then post the corresponding 
exception-less system call to a syscall page and 
switch to another user-mode thread that is ready. 


3. If we run out of ready user-mode threads, FlexSC 
checks the syscall page for any syscall entries that 
have been completed, waking up the appropriate 
user-mode thread so it can obtain the result of the 
completed system call. 


4. As a last resort, flexsc_wait () is called, putting 
the kernel visible thread to sleep until one of the 
pending system calls has completed. 


FlexSC-Threads implements multicore support by cre- 
ating a single kernel visible thread per core available to 
the process, and pinning each kernel visible thread to a 
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specific core. Multiple user-mode threads multiplex exe- 
cution on the kernel visible thread. Since kernel-visitble 
threads only block when there is no more available work, 
there is no need to create more than one kernel visi- 
ble thread per core. Figure 7 depicts the components of 
FlexSC-Threads and how they interact during execution. 

AS an optimization, we have designed FlexSC-Threads 
to register a private set of syscall pages per kernel vis- 
ible thread (i.e., per core). Since syscall pages are pri- 
vate to each core, there is no need to synchronize their 
access with costly atomic instructions. The FlexSC- 
Threads user-mode scheduler implements a simple form 
of cooperative scheduling, with system calls acting as 
yield points. Consequently, syscall pages behave as lock- 
free single-producer (kernel-visible thread) and single- 
consumer (syscall thread) data structures. 

From the kernel side, although syscall threads are 
pinned to specific cores, they do not only execute system 
call requests from syscall pages registered to that core. An 
example of this is shown in Figure 8, where user-mode 
threads execute on core 0, while syscall threads running 
on core | are satisfying system call requests. 

It is important to note that FlexSC-Threads relies on a 
large number of independent user-mode threads to post 
concurrent exception-less system calls. Since threads are 
executing independently, there is no constraint on order- 
ing or serialization of system call execution (thread-safety 
constraints should be enforced at the application level 
and is orthogonal to the system call execution model). 
FlexSC-Threads leverages the independent requests to ef- 
ficiently schedule operating system work on single or mul- 
ticore systems. For this reason, highly threaded work- 
loads, such as internet/network servers, are ideal candi- 
dates for FlexSC-Threads. 


5 Experimental Evaluation 


We first present the results of a microbenchmark that 
shows the overhead of the basic exception-less system 
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Component Specification 


Cores 4+ 
Cache line 64 B for all caches 
Private L1 i-cache 32 KB, 3 cycle latency 
Private L1 d-cache 32 KB, 4 cycle latency 
Private L2 cache 512 KB, 11 cycle latency 
Shared L3 cache 8 MB, 35-40 cycle latency 
































Memory 250 cycle latency (avg.) 
TLB (L1) 64 (data) + 64 (instr.) entries 
TLB (L2) 512 entries 














Table 3: Characteristics of the 2.3GHz Core i7 processor. 
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Figure 9: Exception-less system call cost on a single-core. 


call mechanism, and then we show the performance of 
two popular server applications, Apache and MySQL, 
transparently using exception-less system calls through 
FlexSC-Threads. Finally, we analyze the sensitivity of 
the performance of FlexSC to the number of system call 
pages. 

FlexSC was implemented in the Linux kernel, version 
2.6.33. The baseline line measurements we present were 
collected using unmodified Linux (same version), and the 
default native POSIX threading library (NPTL). We iden- 
tify the baseline configuration as “sync”, and the system 
with exception-less system calls as “flexse”. 

The experiments presented in this section were run on 
an Intel Nehalem (Core i7) processor with the character- 
istics shown in Table 3. The processor has 4 cores, each 
with 2 hyper-threads. We disabled the hyper-threads, as 
well as the “TurboBoost” feature, for all our experiments 
to more easily analyze the measurements obtained. 

For the Apache and MySQL experiments, requests 
were generated by a remote client connected to our test 
machine through a | Gbps network, using a dedicated 
router. The client machine contained a dual core Core2 
processor, running the same Linux installation as the test 
machine, and was not CPU or network constrained in any 
of the experiments. 

All values reported in our evaluation represent the av- 
erage of 5 separate runs. 


5.1 Overhead 


The overhead of executing an exception-less system call 
involves switching to a syscall thread, de-marshalling ar- 
guments from the appropriate syscall page entry, switch- 
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Figure 10: Exception-less system call cost, in the worst case, for 
remote core execution. 


ing back to the user-thread, and retrieving the return value 
from the syscall page entry. To measure this overhead, 
we created a micro-benchmark that successively invokes a 
getppid() system call. Since the user and kernel foot- 
prints of this call is small, the time measured corresponds 
to the direct cost of issuing system calls. 

We varied the number of batched system calls, in the 
exception-less case, to verify if the direct costs are amor- 
tized when batching an increasing number of calls. The 
results obtained executing on a single core are shown in 
Figure 9. The baseline time, show as a horizontal line, is 
the time to execute an exception-based system call on a 
single core. Executing a single exception-less system call 
on a single core is 43% slower than a synchronous call. 
However, when batching 2 or more calls there is no over- 
head, and when batching 32 or more calls, the execution 
of each call is up to 130% faster than a synchronous call. 

We also measured the time to execute system calls on 
a remote core (Figure 10). In addition to the single core 
operations, remote core execution entails sending an inter- 
processor interrupt (IPI) to wake up the remote syscall 
thread. In the remote core case, the time to issue a sin- 
gle exception-less system call can be more than 10 times 
slower than a synchronous system call on the same core. 
This measurement represents a worst case scenario when 
there is no currently executing syscall thread. Despite the 
high overhead, the overhead on remote core execution is 
recouped when batching 32 or more system calls. 


5.2 Apache 


We used Apache version 2.2.15 to evaluate the perfor- 
mance of FlexSC-Threads. Since FlexSC-Threads is bi- 
nary compatible with NPTL, we used the same Apache 
binary for both FlexSC and Linux/NPTL experiments. 
We configured Apache to use a different maximum num- 
ber of spawned threads for each case. The performance 
of Apache running on NPTL degrades with too many 
threads, and we experimentally determined that 200 was 
optimal for our workload and hence used that configura- 
tion for the NPTL case. For the FlexSC-Threads case, we 
raised the maximum number of threads to 1000. 

The workload we used was ApacheBench, a HTTP 
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Figure 11: Comparison of Apache throughput of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores. 
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Figure 12: Breakdown of execution time of Apache and MySQL 
workloads on 4 cores. 
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Figure 13: Comparison of Apache latency of Linux/NPTL and 
FlexSC executing on 1, 2 and 4 cores, with 256 concurrent re- 
quests. 


workload generator that is distributed with Apache. It 
is designed to stress-test the Web server determining the 
number of requests per second that can be serviced, with 
varying number of concurrent requests. 

Figure |1 shows the results of Apache running on 1, 2 
and 4 cores. For the single core experiments, FlexSC em- 
ploys system call batching, and for the multicore experi- 
ments it additionally dynamically redirects system calls to 
maximize core locality. The results show that, except for 
a very low number of concurrent requests, FlexSC outper- 
forms Linux/NPTL by a wide margin. With system call 
batching alone (1 core case), we observe a throughput im- 
provement of up to 86%. The 2 and 4 core experiments 
show that FlexSC achieves up to 116% throughput im- 
provement, showing the added benefit of dynamic core 
specialization. 

Table 4 shows the effects of FlexSC on the microarchi- 
tectural state of the processor while running Apache. It 
displays various processor metrics, collected using hard- 
ware performance counters during execution with 512 
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concurrent requests. The most important metric listed 
is the instruction per cycles (IPC) of the user and ker- 
nel mode for the different setups, as it summarizes the 
efficiency of execution. The other values listed are nor- 
malized values using misses per kilo-instructions (MPKI). 
MPKI is a widely used normalization method that makes 
it easy to compare values obtained from different execu- 
tions. 

The most efficient execution of the four listed in the 
table is FlexSC on | core, yielding an IPC of 0.94 on both 
kernel and user execution, which is 95—108% higher than 
for NPTL. While the FlexSC execution of Apache on 4 
cores is not as efficient as the single core case, with an 
average IPC of 0.75, there is still an 71% improvement, 
on average, over NPTL. 

Most metrics we collected are significantly improved 
with FlexSC. Of particular importance are the perfor- 
mance critical structures that have a high MPKI value 
on NPTL such as d-cache, i-cache, and L2 cache. The 
better use of these microarchitectural structures effec- 
tively demonstrates the premise of this work, namely that 
exception-less system calls can improve processor effi- 
ciency. The only structure which observes more misses 
on FlexSC is the user-mode TLB. We are currently inves- 
tigating the reason for this. 

There is an interesting disparity between the through- 
put improvement (94%) and the IPC improvement (71%) 
in the 4 core case. The difference comes from the added 
benefit of localizing kernel execution with core specializa- 
tion. Figure 12a shows the time breakdown of Apache ex- 
ecuting on 4 cores. FlexSC execution yields significantly 
less idle time than the NPTL execution.° The reduced 
idle time is a consequence of lowering the contention 
on a specific kernel semaphore. Linux protects address 
spaces with a per address-space read-write semaphore 
(mmap_sem). Profiling shows that every Apache thread 
allocates and frees memory for serving requests, and both 
of these operations require the semaphore to be held with 
write permission. Further, the network code in Linux in- 
vokes copy__user (), which transfers data in and out 
of the user address-space. This function verifies that the 
user-space memory is indeed valid, and to do so acquires 


5The execution of Apache on 1 or 2 core did not present idle time. 
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Apache User Kernel | 
Setup IPC | L3 | 12 d-cache | i-cache | TLB | Branch | IPC | L3 | L2 | d-cache | i-cache | TLB | Branch 
sync (1 core) 0.48 | 3.7 | 68.9 63.8 | 130.8 | 7.7 20.9 | 0.45 | 1.4 | 80.0 78.2 | 159.6) 4.6 15.7 | 
flexsc (1 core) | 0.94 | 1.7 | 27.5 35.3 41.3 | 8.8 12.6 | 0.94 | 1.0 | 15.8 31.6 45.2] 3.3 11.2 | 
sync (4 cores) | 0.45 | 3.9 | 64.6 67.9 | 127.6} 9.6 20.2 | 0.43 | 4.4 | 49.5 73.8 | 1249) 44 15.2 | 
flexsc (4 cores) | 0.74 | 1.0 | 37.5 55.5 49.4 | 19.3 13.0 | 0.76 | 1.5 | 19.1 50.2 63.7 | 4.2 11.6 | 























Table 4: Micro-architectural breakdown of Apache execution on uni- and quad-core setups. All values shown, except for IPC, are 
normalized using misses per kilo-instruction (MPKI): therefore, lower numbers yield more efficient execution and higher IPC. 


the semaphore with read permissions. In the NPTL case, 
threads from all 4 cores compete on this semaphore, re- 
sulting in 50% idle time. With FlexSC, kernel code is 
dynamically scheduled to run predominantly on 2 out of 
the 4 cores, halving the contention to this resource, elimi- 
nating 38% of the original idle time. 

Another important metric for servicing Web requests 
besides throughput is latency of individual requests. One 
might intuitively expect that latency of requests to be 
higher under FlexSC because of batching and asyn- 
chronous servicing of system calls, but the opposite is the 
case. Figure 13 shows the average latency of requests 
when processing 256 concurrent requests (other concur- 
rency levels showed similar trends). The results show that 
Web requests on FlexSC are serviced within 50-60% of 
the time needed on NPTL, on average. 


5.3 MySQL 


In the previous section, we demonstrated the effectiveness 
of FlexSC running on a workload with a significant pro- 
portion of kernel time. In this section, we experiment with 
OLTP on MySQL, a workload for which the proportion of 
kernel execution is smaller (roughly 25%). Our evaluation 
used MySQL version 5.5.4 with an InnoDB backend en- 
gine, and as in the Apache evaluation, we used the same 
binary for running on NPTL and on FlexSC. We also used 
the same configuration parameters for both the NPTL and 
FlexSC experiments, after tuning them for the best NPTL 
performance. 

To generate requests to MySQL, we used the sysbench 
system benchmark utility. Sysbench was created for 
benchmarking MySQL processor performance and con- 
tains an OLTP inspired workload generator. The bench- 
mark allows executing concurrent requests by spawning 
multiple client threads, connecting to the server, and se- 
quentially issuing SQL queries. To handle the concurrent 
clients, MySQL spawns a user-level thread per connec- 
tion. At the end, sysbench reports the number of trans- 
actions per second executed by the database, as well as 
average latency information. For these experiments, we 
used a database with 5M rows, resulting in 1.2 GB of data. 
Since we were interested in stressing the CPU component 
of MySQL, we disabled synchronous transactions to disk. 
Given that the configured database was small enough to 
fit in memory, the workload presented no idle time due to 


disk I/O. 

Figure 14 shows the throughput numbers obtained on 
1, 2 and 4 cores when varying the number of concur- 
rent client threads issuing requests to the MySQL server.° 
For this workload, system batching on one core provides 
modest improvements: up to 14% with 256 concurrent re- 
quests. On 2 and 4 cores, however, we see that FlexSC 
provides a consistent improvement with 16 or more con- 
current clients, achieving up to 37%-40% higher through- 
put. 

Table 5 contains the microarchitectural processor met- 
rics collected for the execution of MySQL. Because 
MySQL invokes the kernel less frequently than Apache, 
kernel execution yields high miss rates, resulting in a low 
IPC of 0.33 on NPTL. In the single core case, FlexSC does 
not greatly alter the execution of user-space, but increases 
kernel IPC by 36%. FlexSC allows the kernel to reuse 
state in the processor structures, yielding lower misses 
across most metrics. In the case of 4 cores, FlexSC also 
improves the performance of user-space IPC by as much 
as 30%, compared to NPTL. Despite making less of an 
impact in the kernel IPC than in single core execution, 
there is still a 25% kernel IPC improvement over NPTL. 

Figure 15 shows the average latencies of individual re- 
quests for MySQL execution with 256 concurrent clients. 
As is the case with Apache, the latency of requests on 
FlexSC is improved over execution on NPTL. Requests 
on FlexSC are satisfied within 70-88% of the time used 
by requests on NPTL. 


5.4 Sensitivity Analysis 


In all experiments presented so far, FlexSC was config- 
ured to have 8 system call pages per core, allowing up to 
512 concurrent exception-less system calls per core. 
Figure 16 shows the sensitivity of FlexSC to the num- 
ber of available syscall entries. It depicts the throughput 
of Apache, on | and 4 cores, while servicing 2048 concur- 
rent requests per core, so that there would always be more 
requests available than syscall entries. Uni-core perfor- 
mance approaches its best with 200 to 250 syscall entries 


For both NPTL and FlexSC, increasing the load on MySQL yields 
peak throughput between 32 and 128 concurrent clients after which 
throughput degrades. The main reason for performance degradation is 
the costly and coarse synchronization used in MySQL. MySQL and 
Linux kernel developers have observed similar performance degradation. 
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Figure 14: Comparison of MySQL throughput of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores. 
MySQL User Kernel 
Setup IPC | L3 | L2 | d-cache | i-cache | TLB | Branch | IPC | L3 L2 | d-cache | i-cache | TLB | Branch 
sync (1 core) 1.12 | 0.6 | 21.1 34.8 24.2 | 3.8 7.8 | 0.33 | 16.5 | 125.2 209.6 | 184.9) 3.9 17.4 
flexsc (1 core) | 1.10 | 0.8 | 19.6 36.3 23.6| 5.4 6.9 | 0.45 | 23.2] 55.1 131.9 86.5 | 3.7 13.6 
sync (4 cores) | 0.55 | 3.7 | 15.8 2.2 18.9) 3.1 5.9 | 0.36 | 16.6 | 78.0 147.0} 120.0) 3.6 15.7 
flexsc (4 cores) | 0.72 | 2.7 | 16.7 30.6 20.9] 4.1 6.5 | 0.45 | 18.4 | 46.6 104.4 63.5] 2.5 11.5 





















































Table 5: Micro-architectural breakdown of MySQL execution on uni- and quad-core setups. All values shown, except for IPC, are 
normalized using misses per kilo-instruction (MPKI): therefore, lower numbers yield more efficient execution and higher IPC. 
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Figure 15: Comparison of MySQL latency of Linux/NPTL and 
FlexSC executing on 1, 2 and 4 cores, with 256 concurrent re- 
quests. 
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Figure 16: Execution of Apache on FlexSC-Threads, showing 
the performance sensitivity of FlexSC to different number of 
syscall pages. Each syscall page contains 64 syscall entries. 


(3 to 4 syscall pages), while quad-core execution starts 
to plateau with 300 to 400 syscall entries (6 to 7 syscall 
pages). 

It is particularly interesting to compare Figure 16 with 
figures 9 and 10. The direct cost of mode switching, ex- 
emplified by the micro-benchmark, has a lesser effect on 
performance when compared to the indirect cost of mix- 
ing user- and kernel-mode execution. 
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6 Related Work 


6.1 System Call Batching 


The idea of batching calls in order to save crossings 
has been extensively explored in the systems community. 
Specific to operating systems, multi-calls are used in both 
operating systems and paravirtualized hypervisors as a 
mechanism to address the high overhead of mode switch- 
ing. Cassyopia is a compiler targeted at rewriting pro- 
grams to collect many independent system calls, and sub- 
mitting them as a single multi-call [18]. An interesting 
technique in Cassyopia, which could be eventually ex- 
plored in conjunction with FlexSC, is the concept of a 
looped multi-call where the result of one system call can 
be automatically fed as an argument to another system call 
in the same multi-call. In the context of hypervisors, both 
Xen and VMware currently support a special multi-call 
hypercall feature [4][20]. 

An important difference between multi-calls and 
exception-less system calls is the level of flexibility ex- 
posed. The multi-call proposals do not investigate the 
possibility of parallel execution of system calls, or ad- 
dress the issue of blocking system calls. In multi-calls, 
system calls are executed sequentially; each system call 
must complete before a subsequent can be issued. With 
exception-less system calls, system calls can be executed 
in parallel, and in the presence of blocking, the next call 
can execute immediately. 


6.2 Locality of Execution and Multicores 


Several researchers have studied the effects of operating 
system execution on application performance [1, 3, 7, 6, 
11, 13]. Larus and Parkes also identified processor inef- 
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ficiencies of server workloads, although not focusing on 
the interaction with the operating system. They proposed 
Cohort Scheduling to efficiently execute staged computa- 
tions to improve locality of execution [11]. 

Techniques such as Soft Timers [3] and Lazy Receiver 
Processing [9] also address the issue of locality of execu- 
tion, from the other side of the compute stack: handling 
device interrupts. Both techniques describe how to limit 
processor interference associated with interrupt handling, 
while not impacting the latency of servicing requests. 

Most similar to the multicore execution of FlexSC 
is Computation Spreading proposed by Chakraborty et. 
al [6]. They introduced processor modifications to al- 
low for hardware migration of threads, and evaluated the 
effects on migrating threads upon entering the kernel to 
specialize cores. Their simulation-based results show an 
improvement of up to 20% on Apache, however, they ex- 
plicitly do not model TLBs and provide for fast thread mi- 
gration between cores. On current hardware, synchronous 
thread migration between cores requires a costly inter- 
processor interrupt. 

Recently, both Corey and Factored Operating System 
(fos) have proposed dedicating cores for specific operating 
system functionality [24, 25]. There are two main differ- 
ences between the core specialization possible with these 
proposals and FlexSC. First, both Corey and fos require 
a micro-kernel design of the operating system kernel in 
order to execute specific kernel functionality on dedicated 
cores. Second, FlexSC can dynamically adapt the propor- 
tion of cores used by the kernel, or cores shared by user 
and kernel execution, depending on the current workload 
behavior. 

Explicit off-loading of select OS functionality to cores 
has also been studied for performance [15, 16] and power 
reduction in the presence of single-ISA heterogeneous 
multicores [14]. While these proposals rely on expen- 
sive inter-processor interrupts to offload system calls, we 
hope FlexSC can provide for a more efficient, and flexible, 
mechanism that can be used by such proposals. 


6.3. Non-blocking Execution 


Past research on improving system call performance has 
focused extensively on blocking versus non-blocking be- 
havior. Typically researchers have analyzed the use of 
threading, event-based (non-blocking), and hybrid sys- 
tems for achieving high performance on server applica- 
tions [2, 10, 17, 21, 22, 23]. Capriccio described tech- 
niques to improve performance of user-level thread li- 
braries for server applications [22]. Specifically, Behren 
et al. showed how to efficiently manage thread stacks, 
minimizing wasted space, and propose resource aware 
scheduling to improver server performance. For an 
extensive performance comparison of thread-based and 


event-driven Web server architectures we refer the reader 
to [17]. 

Finally, the Linux community has proposed a generic 
mechanism for implementing non-blocking system calls, 
which is call asynchronous system calls [5]. In their pro- 
posal, system calls are still exception-based, and tenta- 
tively execute synchronously. Like scheduler activations, 
if a blocking condition is detected, they utilize a “syslet” 
thread to block, allowing the user thread to continue exe- 
cution. 

The main difference between many of the proposals for 
non-blocking execution and FlexSC is that none of the 
non-blocking system call proposals completely decouple 
the invocation of the system call from its execution. As 
we have discussed, the flexibility resulting from this de- 
coupling is crucial for efficiently exploring optimizations 
such as system call batching and core specialization. 


7 Concluding Remarks 


In this paper, we introduced the concept of exception-less 
system calls that decouples system call invocation from 
execution. This allows for flexible scheduling of system 
call execution which in turn enables system call batching 
and dynamic core specialization that both improve local- 
ity in a significant way. System calls are issued by writ- 
ing kernel requests to a reserved syscall page using nor- 
mal store operations, and they are executed by special in- 
kernel syscall threads, which then post the results to the 
syscall page. 

In fact, the concept of exception-less system calls origi- 
nated as a mechanism for low-latency communication be- 
tween user and kernel-space with hyper-threaded proces- 
sors in mind. We had hoped that communicating directly 
through the shared L1 cache would be much more effi- 
cient than mode switching. However, the measurements 
presented in Section 2 made it clear that mixing user and 
kernel-mode execution on the same core would not be effi- 
cient for server class workloads. In future work we intend 
to study how to exploit exception-less system calls as a 
communication mechanism in hyper-threaded processors. 

We presented our implementation of FlexSC, a Linux 
kernel extension, and FlexSC-Threads, a 1/-on-N thread- 
ing package that is binary compatible with NPTL and 
that transparently transforms synchronous system calls 
into exception-less ones. With this implementation, 
we demonstrated how FlexSC improves throughput of 
Apache by up to 116% and MySQL by up to 40% while 
requiring no modifications to the applications. We be- 
lieve these two workloads are representative of other 
highly threaded server workloads that would benefit from 
FlexSC. For example, experiments with the BIND DNS 
server demonstrated throughput improvements of between 
30% and 105% depending on the concurrency of requests. 
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In the current implementation of FlexSC, syscall 
threads process system call requests in no specific or- 
der, opportunistically issuing calls as they are posted on 
syscall pages. The asynchronous execution model, how- 
ever, would allow for different selection algorithms. For 
example, syscall threads could sort the requests to con- 
secutively execute requests of the same type, potentially 
yielding greater locality of execution. Also, system calls 
that perform I/O could be prioritized so as to issue them 
as early as possible. Finally, if a large number of cores are 
available, cores could be dedicated to specific system call 
types to promote further locality gains. 
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Abstract: This paper describes Haystack, an object stor- 
age system optimized for Facebook’s Photos applica- 
tion. Facebook currently stores over 260 billion images, 
which translates to over 20 petabytes of data. Users up- 
load one billion new photos (~60 terabytes) each week 
and Facebook serves over one million images per sec- 
ond at peak. Haystack provides a less expensive and 
higher performing solution than our previous approach, 
which leveraged network attached storage appliances 
over NFS. Our key observation is that this traditional 
design incurs an excessive number of disk operations 
because of metadata lookups. We carefully reduce this 
per photo metadata so that Haystack storage machines 
can perform all metadata lookups in main memory. This 
choice conserves disk operations for reading actual data 
and thus increases overall throughput. 


1. Introduction 


Sharing photos is one of Facebook’s most popular fea- 
tures. To date, users have uploaded over 65 billion pho- 
tos making Facebook the biggest photo sharing website 
in the world. For each uploaded photo, Facebook gen- 
erates and stores four images of different sizes, which 
translates to over 260 billion images and more than 20 
petabytes of data. Users upload one billion new photos 
(~60 terabytes) each week and Facebook serves over 
one million images per second at peak. As we expect 
these numbers to increase in the future, photo storage 
poses a significant challenge for Facebook’s infrastruc- 
ture. 

This paper presents the design and implementation 
of Haystack, Facebook’s photo storage system that has 
been in production for the past 24 months. Haystack is 
an object store [7, 10, 12, 13, 25, 26] that we designed 
for sharing photos on Facebook where data is written 
once, read often, never modified, and rarely deleted. We 
engineered our own storage system for photos because 
traditional filesystems perform poorly under our work- 
load. 

In our experience, we find that the disadvantages of 
a traditional POSIX [21] based filesystem are directo- 
ries and per file metadata. For the Photos application 
most of this metadata, such as permissions, is unused 


and thereby wastes storage capacity. Yet the more sig- 
nificant cost is that the file’s metadata must be read from 
disk into memory in order to find the file itself. While 
insignificant on a small scale, multiplied over billions 
of photos and petabytes of data, accessing metadata is 
the throughput bottleneck. We found this to be our key 
problem in using a network attached storage (NAS) ap- 
pliance mounted over NFS. Several disk operations were 
necessary to read a single photo: one (or typically more) 
to translate the filename to an inode number, another to 
read the inode from disk, and a final one to read the 
file itself. In short, using disk IOs for metadata was the 
limiting factor for our read throughput. Observe that in 
practice this problem introduces an additional cost as we 
have to rely on content delivery networks (CDNs), such 
as Akamai [2], to serve the majority of read traffic. 


Given the disadvantages of a traditional approach, 
we designed Haystack to achieve four main goals: 


High throughput and low latency. Our photo storage 
systems have to keep up with the requests users make. 
Requests that exceed our processing capacity are either 
ignored, which is unacceptable for user experience, or 
handled by a CDN, which is expensive and reaches a 
point of diminishing returns. Moreover, photos should 
be served quickly to facilitate a good user experience. 
Haystack achieves high throughput and low latency 
by requiring at most one disk operation per read. We 
accomplish this by keeping all metadata in main mem- 
ory, which we make practical by dramatically reducing 
the per photo metadata necessary to find a photo on disk. 


Fault-tolerant. In large scale systems, failures happen 
every day. Our users rely on their photos being available 
and should not experience errors despite the inevitable 
server crashes and hard drive failures. It may happen 
that an entire datacenter loses power or a cross-country 
link is severed. Haystack replicates each photo in 
geographically distinct locations. If we lose a machine 
we introduce another one to take its place, copying data 
for redundancy as necessary. 


Cost-effective. Haystack performs better and is less 
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expensive than our previous NFS-based approach. We 
quantify our savings along two dimensions: Haystack’s 
cost per terabyte of usable storage and Haystack’s read 
rate normalized for each terabyte of usable storage!. 
In Haystack, each usable terabyte costs ~28% less 
and processes ~4x more reads per second than an 
equivalent terabyte on a NAS appliance. 


Simple. In a production environment we cannot over- 
state the strength of a design that is straight-forward 
to implement and to maintain. As Haystack is a new 
system, lacking years of production-level testing, we 
paid particular attention to keeping it simple. That 
simplicity let us build and deploy a working system in a 
few months instead of a few years. 


This work describes our experience with Haystack 
from conception to implementation of a production 
quality system serving billions of images a day. Our 
three main contributions are: 


e Haystack, an object storage system optimized for 
the efficient storage and retrieval of billions of pho- 
tos. 


e Lessons learned in building and scaling an inex- 
pensive, reliable, and available photo storage sys- 
tem. 


e A characterization of the requests made to Face- 
book’s photo sharing application. 


We organize the remainder of this paper as fol- 
lows. Section 2 provides background and highlights 
the challenges in our previous architecture. We de- 
scribe Haystack’s design and implementation in Sec- 
tion 3. Section 4 characterizes our photo read and write 
workload and demonstrates that Haystack meets our de- 
sign goals. We draw comparisons to related work in Sec- 
tion 5 and conclude this paper in Section 6. 


2 Background & Previous Design 


In this section, we describe the architecture that ex- 
isted before Haystack and highlight the major lessons 
we learned. Because of space constraints our discus- 
sion of this previous design elides several details of a 
production-level deployment. 


2.1 Background 


We begin with a brief overview of the typical design 
for how web servers, content delivery networks (CDNs), 
and storage systems interact to serve photos on a popular 


'The term ‘usable’ takes into account capacity consumed by fac- 
tors such as RAID level, replication, and the underlying filesystem 
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site. Figure | depicts the steps from the moment when 
a user visits a page containing an image until she down- 
loads that image from its location on disk. When visiting 
a page the user’s browser first sends an HTTP request 
to a web server which is responsible for generating the 
markup for the browser to render. For each image the 
web server constructs a URL directing the browser to a 
location from which to download the data. For popular 
sites this URL often points to a CDN. If the CDN has 
the image cached then the CDN responds immediately 
with the data. Otherwise, the CDN examines the URL, 
which has enough information embedded to retrieve the 
photo from the site’s storage systems. The CDN then 
updates its cached data and sends the image to the user’s 
browser. 


2.2 NFS-based Design 


In our first design we implemented the photo storage 
system using an NFS-based approach. While the rest 
of this subsection provides more detail on that design, 
the major lesson we learned is that CDNs by themselves 
do not offer a practical solution to serving photos on a 
social networking site. CDNs do effectively serve the 
hottest photos— profile pictures and photos that have 
been recently uploaded—but a social networking site 
like Facebook also generates a large number of requests 
for less popular (often older) content, which we refer to 
as the long tail. Requests from the long tail account for a 
significant amount of our traffic, almost all of which ac- 
cesses the backing photo storage hosts as these requests 
typically miss in the CDN. While it would be very con- 
venient to cache all of the photos for this long tail, doing 
so would not be cost effective because of the very large 
cache sizes required. 


Our NFS-based design stores each photo in its own 
file on a set of commercial NAS appliances. A set of 
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machines, Photo Store servers, then mount all the vol- 
umes exported by these NAS appliances over NFS. Fig- 
ure 2 illustrates this architecture and shows Photo Store 
servers processing HTTP requests for images. From an 
image’s URL a Photo Store server extracts the volume 
and full path to the file, reads the data over NFS, and 
returns the result to the CDN. 

We initially stored thousands of files in each directory 
of an NFS volume which led to an excessive number of 
disk operations to read even a single image. Because 
of how the NAS appliances manage directory metadata, 
placing thousands of files in a directory was extremely 
inefficient as the directory’s blockmap was too large to 
be cached effectively by the appliance. Consequently 
it was common to incur more than 10 disk operations to 
retrieve a single image. After reducing directory sizes to 
hundreds of images per directory, the resulting system 
would still generally incur 3 disk operations to fetch an 
image: one to read the directory metadata into memory, 
a second to load the inode into memory, and a third to 
read the file contents. 

To further reduce disk operations we let the Photo 
Store servers explicitly cache file handles returned by 
the NAS appliances. When reading a file for the first 
time a Photo Store server opens a file normally but also 
caches the filename to file handle mapping in mem- 
cache [18]. When requesting a file whose file handle 
is cached, a Photo Store server opens the file directly 
using a custom system call, open_by_filehandle, that 
we added to the kernel. Regrettably, this file handle 
cache provides only a minor improvement as less pop- 
ular photos are less likely to be cached to begin with. 


One could argue that an approach in which all file han- 
dles are stored in memcache might be a workable solu- 
tion. However, that only addresses part of the problem 
as it relies on the NAS appliance having all of its in- 
odes in main memory, an expensive requirement for tra- 
ditional filesystems. The major lesson we learned from 
the NAS approach is that focusing only on caching— 
whether the NAS appliance’s cache or an external cache 
like memcache—has limited impact for reducing disk 
operations. The storage system ends up processing the 
long tail of requests for less popular photos, which are 
not available in the CDN and are thus likely to miss in 
our caches. 


2.3 Discussion 


It would be difficult for us to offer precise guidelines 
for when or when not to build a custom storage system. 
However, we believe it still helpful for the community 
to gain insight into why we decided to build Haystack. 

Faced with the bottlenecks in our NFS-based design, 
we explored whether it would be useful to build a sys- 
tem similar to GFS [9]. Since we store most of our user 
data in MySQL databases, the main use cases for files 
in our system were the directories engineers use for de- 
velopment work, log data, and photos. NAS appliances 
offer a very good price/performance point for develop- 
ment work and for log data. Furthermore, we leverage 
Hadoop [11] for the extremely large log data. Serving 
photo requests in the long tail represents a problem for 
which neither MySQL, NAS appliances, nor Hadoop are 
well-suited. 

One could phrase the dilemma we faced as exist- 
ing storage systems lacked the right RAM-to-disk ra- 
tio. However, there is no right ratio. The system just 
needs enough main memory so that all of the filesystem 
metadata can be cached at once. In our NAS-based ap- 
proach, one photo corresponds to one file and each file 
requires at least one inode, which is hundreds of bytes 
large. Having enough main memory in this approach is 
not cost-effective. To achieve a better price/performance 
point, we decided to build a custom storage system that 
reduces the amount of filesystem metadata per photo so 
that having enough main memory is dramatically more 
cost-effective than buying more NAS appliances. 


3 Design & Implementation 


Facebook uses a CDN to serve popular images and 
leverages Haystack to respond to photo requests in the 
long tail efficiently. When a web site has an I/O bot- 
tleneck serving static content the traditional solution is 
to use a CDN. The CDN shoulders enough of the bur- 
den so that the storage system can process the remaining 
tail. At Facebook a CDN would have to cache an unrea- 
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sonably large amount of the static content in order for 
traditional (and inexpensive) storage approaches not to 
be I/O bound. 

Understanding that in the near future CDNs would not 
fully solve our problems, we designed Haystack to ad- 
dress the critical bottleneck in our NFS-based approach: 
disk operations. We accept that requests for less popu- 
lar photos may require disk operations, but aim to limit 
the number of such operations to only the ones neces- 
sary for reading actual photo data. Haystack achieves 
this goal by dramatically reducing the memory used for 
filesystem metadata, thereby making it practical to keep 
all this metadata in main memory. 

Recall that storing a single photo per file resulted 
in more filesystem metadata than could be reasonably 
cached. Haystack takes a straight-forward approach: 
it stores multiple photos in a single file and therefore 
maintains very large files. We show that this straight- 
forward approach is remarkably effective. Moreover, we 
argue that its simplicity is its strength, facilitating rapid 
implementation and deployment. We now discuss how 
this core technique and the architectural components 
surrounding it provide a reliable and available storage 
system. In the following description of Haystack, we 
distinguish between two kinds of metadata. Applica- 
tion metadata describes the information needed to con- 
struct a URL that a browser can use to retrieve a photo. 
Filesystem metadata identifies the data necessary for a 
host to retrieve the photos that reside on that host’s disk. 


3.1. Overview 


The Haystack architecture consists of 3 core compo- 
nents: the Haystack Store, Haystack Directory, and 
Haystack Cache. For brevity we refer to these com- 
ponents with ‘Haystack’ elided. The Store encapsu- 
lates the persistent storage system for photos and is the 
only component that manages the filesystem metadata 
for photos. We organize the Store’s capacity by phys- 
ical volumes. For example, we can organize a server’s 
10 terabytes of capacity into 100 physical volumes each 
of which provides 100 gigabytes of storage. We further 
group physical volumes on different machines into logi- 
cal volumes. When Haystack stores a photo on a logical 
volume, the photo is written to all corresponding physi- 
cal volumes. This redundancy allows us to mitigate data 
loss due to hard drive failures, disk controller bugs, etc. 
The Directory maintains the logical to physical mapping 
along with other application metadata, such as the log- 
ical volume where each photo resides and the logical 
volumes with free space. The Cache functions as our in- 
ternal CDN, which shelters the Store from requests for 
the most popular photos and provides insulation if up- 
stream CDN nodes fail and need to refetch content. 
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Figure 3: Serving a photo 


Figure 3 illustrates how the Store, Directory, and 
Cache components fit into the canonical interactions be- 
tween a user’s browser, web server, CDN, and storage 
system. In the Haystack architecture the browser can be 
directed to either the CDN or the Cache. Note that while 
the Cache is essentially a CDN, to avoid confusion we 
use “CDN’ to refer to external systems and ‘Cache’ to 
refer to our internal one that caches photos. Having an 
internal caching infrastructure gives us the ability to re- 
duce our dependence on external CDNs. 

When a user visits a page the web server uses the Di- 
rectory to construct a URL for each photo. The URL 
contains several pieces of information, each piece cor- 
responding to the sequence of steps from when a user’s 
browser contacts the CDN (or Cache) to ultimately re- 
trieving a photo from a machine in the Store. A typical 
URL that directs the browser to the CDN looks like the 
following: 


http://(CDN)/(Cache)/(Machine id)/(Logical volume, Photo) 


The first part of the URL specifies from which CDN 
to request the photo. The CDN can lookup the photo 
internally using only the last part of the URL: the logical 
volume and the photo id. If the CDN cannot locate the 
photo then it strips the CDN address from the URL and 
contacts the Cache. The Cache does a similar lookup to 
find the photo and, on a miss, strips the Cache address 
from the URL and requests the photo from the specified 
Store machine. Photo requests that go directly to the 
Cache have a similar workflow except that the URL is 
missing the CDN specific information. 
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Figure 4: Uploading a photo 


Figure 4 illustrates the upload path in Haystack. 
When a user uploads a photo she first sends the data to a 
web server. Next, that server requests a write-enabled 
logical volume from the Directory. Finally, the web 
server assigns a unique id to the photo and uploads it 
to each of the physical volumes mapped to the assigned 
logical volume. 


3.2 Haystack Directory 


The Directory serves four main functions. First, it pro- 
vides a mapping from logical volumes to physical vol- 
umes. Web servers use this mapping when uploading 
photos and also when constructing the image URLs for 
a page request. Second, the Directory load balances 
writes across logical volumes and reads across physi- 
cal volumes. Third, the Directory determines whether 
a photo request should be handled by the CDN or by 
the Cache. This functionality lets us adjust our depen- 
dence on CDNs. Fourth, the Directory identifies those 
logical volumes that are read-only either because of op- 
erational reasons or because those volumes have reached 
their storage capacity. We mark volumes as read-only at 
the granularity of machines for operational ease. 

When we increase the capacity of the Store by adding 
new machines, those machines are write-enabled; only 
write-enabled machines receive uploads. Over time the 
available capacity on these machines decreases. When a 
machine exhausts its capacity, we mark it as read-only. 
In the next subsection we discuss how this distinction 
has subtle consequences for the Cache and Store. 

The Directory is a relatively straight-forward compo- 
nent that stores its information in a replicated database 
accessed via a PHP interface that leverages memcache 





to reduce latency. In the event that we lose the data on 
a Store machine we remove the corresponding entry in 
the mapping and replace it when a new Store machine is 
brought online. 


3.3. Haystack Cache 


The Cache receives HTTP requests for photos from 
CDNs and also directly from users’ browsers. We or- 
ganize the Cache as a distributed hash table and use a 
photo’s id as the key to locate cached data. If the Cache 
cannot immediately respond to the request, then the 
Cache fetches the photo from the Store machine iden- 
tified in the URL and replies to either the CDN or the 
user’s browser as appropriate. 

We now highlight an important behavioral aspect of 
the Cache. It caches a photo only if two conditions 
are met: (a) the request comes directly from a user and 
not the CDN and (b) the photo is fetched from a write- 
enabled Store machine. The justification for the first 
condition is that our experience with the NFS-based de- 
sign showed post-CDN caching is ineffective as it is un- 
likely that a request that misses in the CDN would hit in 
our internal cache. The reasoning for the second is in- 
direct. We use the Cache to shelter write-enabled Store 
machines from reads because of two interesting proper- 
ties: photos are most heavily accessed soon after they 
are uploaded and filesystems for our workload gener- 
ally perform better when doing either reads or writes 
but not both (Section 4.1). Thus the write-enabled Store 
machines would see the most reads if it were not for 
the Cache. Given this characteristic, an optimization we 
plan to implement is to proactively push recently up- 
loaded photos into the Cache as we expect those photos 
to be read soon and often. 


3.4 Haystack Store 


The interface to Store machines is intentionally basic. 
Reads make very specific and well-contained requests 
asking for a photo with a given id, for a certain logical 
volume, and from a particular physical Store machine. 
The machine returns the photo if it is found. Otherwise, 
the machine returns an error. 

Each Store machine manages multiple physical vol- 
umes. Each volume holds millions of photos. For 
concreteness, the reader can think of a physical vol- 
ume as simply a very large file (100 GB) saved as 
‘/hay/haystack_<logical volume id>’. A Store machine 
can access a photo quickly using only the id of the cor- 
responding logical volume and the file offset at which 
the photo resides. This knowledge is the keystone of 
the Haystack design: retrieving the filename, offset, and 
size for a particular photo without needing disk opera- 
tions. A Store machine keeps open file descriptors for 
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Figure 5: Layout of Haystack Store file 
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Field Explanation 

Header Magic number used for recovery 

Cookie Random number to mitigate 
brute force lookups 

Key 64-bit photo id 

Alternate key 32-bit supplemental id 

Flags Signifies deleted status 

Size Data size 

Data The actual photo data 

Footer Magic number for recovery 

Data Checksum | Used to check integrity 

Padding Total needle size is aligned to 8 bytes 


Table 1: Explanation of fields in a needle 


each physical volume that it manages and also an in- 
memory mapping of photo ids to the filesystem meta- 
data (i.e., file, offset and size in bytes) critical for re- 
trieving that photo. 

We now describe the layout of each physical volume 
and how to derive the in-memory mapping from that 
volume. A Store machine represents a physical volume 
as a large file consisting of a superblock followed by 
a sequence of needles. Each needle represents a photo 
stored in Haystack. Figure 5 illustrates a volume file and 
the format of each needle. Table | describes the fields 
in each needle. 

To retrieve needles quickly, each Store machine main- 
tains an in-memory data structure for each of its vol- 
umes. That data structure maps pairs of (key, alter- 
nate key)* to the corresponding needle’s flags, size in 


?For historical reasons, a photo’s id corresponds to the key while its 
type is used for the alternate key. During an upload, web servers scale 
each photo to four different sizes (or types) and store them as separate 
needles, but with the same key. The important distinction among these 
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bytes, and volume offset. After a crash, a Store machine 
can reconstruct this mapping directly from the volume 
file before processing requests. We now describe how 
a Store machine maintains its volumes and in-memory 
mapping while responding to read, write, and delete re- 
quests (the only operations supported by the Store). 


3.4.1. Photo Read 


When a Cache machine requests a photo it supplies the 
logical volume id, key, alternate key, and cookie to the 
Store machine. The cookie is a number embedded in 
the URL for a photo. The cookie’s value is randomly 
assigned by and stored in the Directory at the time that 
the photo is uploaded. The cookie effectively eliminates 
attacks aimed at guessing valid URLs for photos. 

When a Store machine receives a photo request from a 
Cache machine, the Store machine looks up the relevant 
metadata in its in-memory mappings. If the photo has 
not been deleted the Store machine seeks to the appro- 
priate offset in the volume file, reads the entire needle 
from disk (whose size it can calculate ahead of time), 
and verifies the cookie and the integrity of the data. If 
these checks pass then the Store machine returns the 
photo to the Cache machine. 


3.4.2. Photo Write 


When uploading a photo into Haystack web servers pro- 
vide the logical volume id, key, alternate key, cookie, 
and data to Store machines. Each machine syn- 
chronously appends needle images to its physical vol- 
ume files and updates in-memory mappings as needed. 
While simple, this append-only restriction complicates 
some operations that modify photos, such as rotations. 
As Haystack disallows overwriting needles, photos can 
only be modified by adding an updated needle with the 
same key and alternate key. If the new needle is written 
to a different logical volume than the original, the Direc- 
tory updates its application metadata and future requests 
will never fetch the older version. If the new needle is 
written to the same logical volume, then Store machines 
append the new needle to the same corresponding physi- 
cal volumes. Haystack distinguishes such duplicate nee- 
dles based on their offsets. That is, the latest version of a 
needle within a physical volume is the one at the highest 
offset. 


3.4.3 Photo Delete 


Deleting a photo is straight-forward. A Store machine 
sets the delete flag in both the in-memory mapping 
and synchronously in the volume file. Requests to get 
deleted photos first check the in-memory flag and return 
errors if that flag is enabled. Note that the space occu- 


needles is the alternate key field, which in decreasing order can be ‘n,’ 
‘a, ‘s, or ‘t’. 
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Figure 6: Layout of Haystack Index file 


pied by deleted needles is for the moment lost. Later, 
we discuss how to reclaim deleted needle space by com- 
pacting volume files. 


3.4.4 The Index File 


Store machines use an important optimization—the in- 
dex file—when rebooting. While in theory a machine 
can reconstruct its in-memory mappings by reading all 
of its physical volumes, doing so is time-consuming as 
the amount of data (terabytes worth) has to all be read 
from disk. Index files allow a Store machine to build its 
in-memory mappings quickly, shortening restart time. 


Store machines maintain an index file for each of 
their volumes. The index file is a checkpoint of the in- 
memory data structures used to locate needles efficiently 
on disk. An index file’s layout is similar to a volume 
file’s, containing a superblock followed by a sequence 
of index records corresponding to each needle in the su- 
perblock. These records must appear in the same order 
as the corresponding needles appear in the volume file. 
Figure 6 illustrates the layout of the index file and Ta- 
ble 2 explains the different fields in each record. 


Restarting using the index is slightly more compli- 
cated than just reading the indices and initializing the 
in-memory mappings. The complications arise because 
index files are updated asynchronously, meaning that 
index files may represent stale checkpoints. When we 
write a new photo the Store machine synchronously ap- 
pends a needle to the end of the volume file and asyn- 
chronously appends a record to the index file. When 
we delete a photo, the Store machine synchronously sets 
the flag in that photo’s needle without updating the in- 
dex file. These design decisions allow write and delete 
operations to return faster because they avoid additional 
synchronous disk writes. They also cause two side ef- 
fects we must address: needles can exist without corre- 
sponding index records and index records do not reflect 
deleted photos. 


Field | Explanation 
Key 64-bit key 
Alternate key | 32-bit alternate key 





Flags Currently unused 
Offset Needle offset in the Haystack Store 
Size Needle data size 


Table 2: Explanation of fields in index file. 


We refer to needles without corresponding index 
records as orphans. During restarts, a Store machine 
sequentially examines each orphan, creates a match- 
ing index record, and appends that record to the index 
file. Note that we can quickly identify orphans because 
the last record in the index file corresponds to the last 
non-orphan needle in the volume file. To complete the 
restart, the Store machine now initializes its in-memory 
mappings using only the index files. 


Since index records do not reflect deleted photos, a 
Store machine may retrieve a photo that has in fact been 
deleted. To address this issue, after a Store machine 
reads the entire needle for a photo, that machine can 
then inspect the deleted flag. If a needle is marked as 
deleted the Store machine updates its in-memory map- 
ping accordingly and notifies the Cache that the object 
was not found. 


3.4.5 Filesystem 


We describe Haystack as an object store that utilizes 
a generic Unix-like filesystem, but some filesystems 
are better suited for Haystack than others. In partic- 
ular, the Store machines should use a filesystem that 
does not need much memory to be able to perform ran- 
dom seeks within a large file quickly. Currently, each 
Store machine uses XFS [24], an extent based file sys- 
tem. XFS has two main advantages for Haystack. First, 
the blockmaps for several contiguous large files can 
be small enough to be stored in main memory. Sec- 
ond, XFS provides efficient file preallocation, mitigat- 
ing fragmentation and reining in how large block maps 
can grow. 


Using XFS, Haystack can eliminate disk operations 
for retrieving filesystem metadata when reading a photo. 
This benefit, however, does not imply that Haystack can 
guarantee every photo read will incur exactly one disk 
operation. There exists corner cases where the filesys- 
tem requires more than one disk operation when photo 
data crosses extents or RAID boundaries. Haystack pre- 
allocates 1 gigabyte extents and uses 256 kilobyte RAID 
stripe sizes so that in practice we encounter these cases 
rarely. 
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3.5 Recovery from failures 


Like many other large-scale systems running on com- 
modity hardware [5, 4, 9], Haystack needs to tolerate 
a variety of failures: faulty hard drives, misbehaving 
RAID controllers, bad motherboards, etc. We use two 
straight-forward techniques to tolerate failures—one for 
detection and another for repair. 

To proactively find Store machines that are having 
problems, we maintain a background task, dubbed pitch- 
fork, that periodically checks the health of each Store 
machine. Pitchfork remotely tests the connection to 
each Store machine, checks the availability of each vol- 
ume file, and attempts to read data from the Store ma- 
chine. If pitchfork determines that a Store machine con- 
sistently fails these health checks then pitchfork auto- 
matically marks all logical volumes that reside on that 
Store machine as read-only. We manually address the 
underlying cause for the failed checks offline. 

Once diagnosed, we may be able to fix the prob- 
lem quickly. Occasionally, the situation requires a more 
heavy-handed bulk sync operation in which we reset the 
data of a Store machine using the volume files supplied 
by a replica. Bulk syncs happen rarely (a few each 
month) and are simple albeit slow to carry out. The main 
bottleneck is that the amount of data to be bulk synced is 
often orders of magnitude greater than the speed of the 
NIC on each Store machine, resulting in hours for mean 
time to recovery. We are actively exploring techniques 
to address this constraint. 


3.6 Optimizations 


We now discuss several optimizations important to 
Haystack’s success. 

3.6.1 Compaction 

Compaction is an online operation that reclaims the 
space used by deleted and duplicate needles (needles 
with the same key and alternate key). A Store machine 
compacts a volume file by copying needles into a new 
file while skipping any duplicate or deleted entries. Dur- 
ing compaction, deletes go to both files. Once this pro- 
cedure reaches the end of the file, it blocks any further 
modifications to the volume and atomically swaps the 
files and in-memory structures. 

We use compaction to free up space from deleted pho- 
tos. The pattern for deletes is similar to photo views: 
young photos are a lot more likely to be deleted. Over 
the course of a year, about 25% of the photos get deleted. 
3.6.2 Saving more memory 
As described, a Store machine maintains an in-memory 
data structure that includes flags, but our current system 
only uses the flags field to mark a needle as deleted. We 
eliminate the need for an in-memory representation of 
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flags by setting the offset to be 0 for deleted photos. In 
addition, Store machines do not keep track of cookie 
values in main memory and instead check the supplied 
cookie after reading a needle from disk. Store machines 
reduce their main memory footprints by 20% through 
these two techniques. 

Currently, Haystack uses on average 10 bytes of main 
memory per photo. Recall that we scale each uploaded 
image to four photos all with the same key (64 bits), dif- 
ferent alternate keys (32 bits), and consequently differ- 
ent data sizes (16 bits). In addition to these 32 bytes, 
Haystack consumes approximately 2 bytes per image 
in overheads due to hash tables, bringing the total for 
four scaled photos of the same image to 40 bytes. For 
comparison, consider that an xfs_inode_t structure in 
Linux is 536 bytes. 


3.6.3 Batch upload 


Since disks are generally better at performing large se- 
quential writes instead of small random writes, we batch 
uploads together when possible. Fortunately, many 
users upload entire albums to Facebook instead of single 
pictures, providing an obvious opportunity to batch the 
photos in an album together. We quantify the improve- 
ment of aggregating writes together in Section 4. 


4 Evaluation 


We divide our evaluation into four parts. In the first we 
characterize the photo requests seen by Facebook. In 
the second and third we show the effectiveness of the 
Directory and Cache, respectively. In the last we ana- 
lyze how well the Store performs using both synthetic 
and production workloads. 


4.1 Characterizing photo requests 


Photos are one of the primary kinds of content that users 
share on Facebook. Users upload millions of photos ev- 
ery day and recently uploaded photos tend to be much 
more popular than older ones. Figure 7 illustrates how 
popular each photo is as a function of the photo’s age. 
To understand the shape of the graph, it is useful to dis- 
cuss what drives Facebook’s photo requests. 


4.1.1 Features that drive photo requests 


Two features are responsible for 98% of Facebook’s 
photo requests: News Feed and albums. The News Feed 
feature shows users recent content that their friends have 
shared. The album feature lets a user browse her friends’ 
pictures. She can view recently uploaded photos and 
also browse all of the individual albums. 

Figure 7 shows a sharp rise in requests for photos that 
are a few days old. News Feed drives much of the traffic 
for recent photos and falls sharply away around 2 days 
when many stories stop being shown in the default Feed 
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Figure 7: Cumulative distribution function of the num- 
ber of photos requested in a day categorized by age (time 


since it was uploaded). 
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Table 3: Volume of daily photo traffic. 


view. There are two key points to highl ght from the fig- 
ure. First, the rapid decline in popularity suggests that 
caching at both CDNs and in the Cache can be very ef- 
fective for hosting popular content. Second, the graph 
has a long tail implying that a significant number of re- 
quests cannot be dealt with using cached data. 


4.1.2 Traffic Volume 


Table 3 shows the volume of photo traffic on Facebook. 
The number of Haystack photos written is 12 times the 
number of photos uploaded since our application scales 
each image to 4 sizes and saves each size in 3 different 
locations. The table shows that Haystack responds to 
approximately 10% of all photo requests from CDNs. 
Observe that smaller images account for most of the 
photos viewed. This trait underscores our desire to min- 
imize metadata overhead as inefficiencies can quickly 
add up. Additionally, reading smaller images is typi- 
cally a more latency sensitive operation for Facebook as 
they are displayed in the News Feed whereas larger im- 
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Figure 8: Volume of multi-write operations sent to 9 
different write-enabled Haystack Store machines. The 
graph has 9 different lines that closely overlap each 
other. 


ages are shown in albums and can be prefetched to hide 
latency. 


4.2 Haystack Directory 


The Haystack Directory balances reads and writes 
across Haystack Store machines. Figure 8 depicts that as 
expected, the Directory’s straight-forward hashing pol- 
icy to distribute reads and writes is very effective. The 
graph shows the number of multi-write operations seen 
by 9 different Store machines which were deployed into 
production at the same time. Each of these boxes store a 
different set of photos. Since the lines are nearly indis- 
tinguishable, we conclude that the Directory balances 
writes well. Comparing read traffic across Store ma- 
chines shows similarly well-balanced behavior. 


4.3. Haystack Cache 


Figure 9 shows the hit rate for the Haystack Cache. Re- 
call that the Cache only stores a photo if it is saved on 
a write-enabled Store machine. These photos are rel- 
atively recent, which explains the high hit rates of ap- 
proximately 80%. Since the write-enabled Store ma- 
chines would also see the greatest number of reads, the 
Cache is effective in dramatically reducing the read re- 
quest rate for the machines that would be most affected. 


4.4 Haystack Store 


Recall that Haystack targets the long tail of photo re- 
quests and aims to maintain high-throughput and low- 
latency despite seemingly random reads. We present 
performance results of Store machines on both synthetic 
and production workloads. 
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Reads Writes 
Benchmark [Config # Operations } Thr oughput Latency (in ms) Throughput Latency (in ms) 
(in images/s) Avg. Std. dev. (in images/s) Avg. Std. dev. 

Random IO _ [ Only Reads ] 902.3 33.2 26.8 — — — 
Haystress [A # Only Reads ] 770.6 38.9 30.2 = — — 
Haystress _[ B # Only Reads ] 877.8 34.2 28.1 7 — 7 
Haystress [ C # Only Multi-Writes ] — — — 6099.4 4.9 16.0 
Haystress [ D # Only Multi-Writes ] — — — 7899.7 15.2 15.3 
Haystress [FE # Only Multi-Writes ] = = = 10843.8 43.9 16.3 
Haystress  [ F # Reads & Multi-Writes ] 718.1 41.6 31.6 232.0 11.9 6.3 
Haystress [G#Reads & Multi-Writes ] 692.8 42.8 33.7 440.0 11.9 6.9 











Table 4: Throughput and latency of read and multi-write operations on synthetic workloads. Config B uses a mix of 
8KB and 64KB images. Remaining configs use 64KB images. 
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Figure 9: Cache hit rate for images that might be poten- 
tially stored in the Haystack Cache. 


4.4.1 


We deploy Store machines on commodity storage 
blades. The typical hardware configuration of a 2U stor- 
age blade has 2 hyper-threaded quad-core Intel Xeon 
CPUs, 48 GB memory, a hardware raid controller with 
256-512MB NVRAM, and 12 x 1TB SATA drives. 
Each storage blade provides approximately 9TB of 
capacity, configured as a RAID-6 partition managed by 
the hardware RAID controller. RAID-6 provides ade- 
quate redundancy and excellent read performance while 
keeping storage costs down. The controller’s NVRAM 
write-back cache mitigates RAID-6’s reduced write per- 
formance. Since our experience suggests that caching 
photos on Store machines is ineffective, we reserve the 
NVRAM fully for writes. We also disable disk caches 
in order to guarantee data consistency in the event of a 


Experimental setup 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


crash or power loss. 
4.4.2 Benchmark performance 


We assess the performance of a Store machine using two 
benchmarks: Randomio [22] and Haystress. Randomio 
is an open-source multithreaded disk I/O program that 
we use to measure the raw capabilities of storage de- 
vices. It issues random 64KB reads that use direct I/O to 
make sector aligned requests and reports the maximum 
sustainable throughput. We use Randomio to establish a 
baseline for read throughput against which we can com- 
pare results from our other benchmark. 


Haystress is a custom built multi-threaded program 
that we use to evaluate Store machines for a variety of 
synthetic workloads. It communicates with a Store ma- 
chine via HTTP (as the Cache would) and assesses the 
maximum read and write throughput a Store machine 
can maintain. Haystress issues random reads over a 
large set of dummy images to reduce the effect of the 
machine’s buffer cache; that is, nearly all reads require 
a disk operation. In this paper, we use seven different 
Haystress workloads to evaluate Store machines. 


Table 4 characterizes the read and write throughputs 
and associated latencies that a Store machine can sus- 
tain under our benchmarks. Workload A performs ran- 
dom reads to 64KB images on a Store machine with 201 
volumes. The results show that Haystack delivers 85% 
of the raw throughput of the device while incurring only 
17% higher latency. 


We attribute a Store machine’s overhead to four fac- 
tors: (a) it runs on top of the filesystem instead of access- 
ing disk directly; (b) disk reads are larger than 64KB as 
entire needles need to be read; (c) stored images may 
not be aligned to the underlying RAID-6 device stripe 
size so a small percentage of images are read from more 
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than one disk; and (d) CPU overhead of Haystack server 
(index access, checksum calculations, etc.) 

In workload B, we again examine a read-only work- 
load but alter 70% of the reads so that they request 
smaller size images (8KB instead of 64KB). In practice, 
we find that most requests are not for the largest size 
images (as would be shown in albums) but rather for the 
thumbnails and profile pictures. 

Workloads C, D, and E show a Store machine’s write 
throughput. Recall that Haystack can batch writes to- 
gether. Workloads C, D, and E group 1, 4, and 16 writes 
into a single multi-write, respectively. The table shows 
that amortizing the fixed cost of writes over 4 and 16 
images improves throughput by 30% and 78% respec- 
tively. As expected, this reduces per image latency, as 
well. 

Finally, we look at the performance in the presence 
of both read and write operations. Workload F uses a 
mix of 98% reads and 2% multi-writes while G uses 
a mix of 96% reads and 4% multi-writes where each 
multi-write writes 16 images. These ratios reflect what 
is often observed in production. The table shows that the 
Store delivers high read throughput even in the presence 
of writes. 


4.4.3. Production workload 


The section examines the performance of the Store on 
production machines. As noted in Section 3, there 
are two classes of Stores—write-enabled and read-only. 
Write-enabled hosts service read and write requests, 
read-only hosts only service read requests. Since these 
two classes have fairly different traffic characteristics, 
we analyze a group of machines in each class. All ma- 
chines have the same hardware configuration. 

Viewed at a per-second granularity, there can be large 
spikes in the volume of photo read and write operations 
that a Store box sees. To ensure reasonable latency even 
in the presence of these spikes, we conservatively allo- 
cate a large number of write-enabled machines so that 
their average utilization is low. 

Figure 10 shows the frequency of the different types 
of operations on a read-only and a write-enabled Store 
machine. Note that we see peak photo uploads on Sun- 
day and Monday, with a smooth drop the rest of the 
week until we level out on Thursday to Saturday. Then 
a new Sunday arrives and we hit a new weekly peak. In 
general our footprint grows by 0.2% to 0.5% per day. 

As noted in Section 3, write operations to the Store 
are always multi-writes on production machines to 
amortize the fixed cost of write operations. Finding 
groups of images is fairly straightforward since 4 dif- 
ferent sizes of each photo is stored in Haystack. It is 
also common for users to upload a batch of photos into 
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Figure 10: Rate of different operations on two Haystack 
Store machines: One read-only and the other write- 
enabled. 


a photo album. As a combination of these two factors, 
the average number of images written per multi-write 
for this write-enabled machine is 9.27. 

Section 4.1.2 explained that both read and delete rates 
are high for recently uploaded photos and drop over 
time. This behavior can be also be observed in Fig- 
ure 10; the write-enabled boxes see many more requests 
(even though some of the read traffic is served by the 
Cache). 

Another trend worth noting: as more data gets written 
to write-enabled boxes the volume of photos increases, 
resulting in an increase in the read request rate. 

Figure 11 shows the latency of read and multi-write 
operations on the same two machines as Figure 10 over 
the same period. 

The latency of multi-write operations is fairly low 
(between | and 2 milliseconds) and stable even as the 
volume of traffic varies dramatically. Haystack ma- 
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Figure 11: Average latency of Read and Multi-write op- 
erations on the two Haystack Store machines in Fig- 
ure 10 over the same 3 week period. 


chines have a NVRAM-backed raid controller which 
buffers writes for us. As described in Section 3, the 
NVRAM allows us to write needles asynchronously and 
then issue a single fsync to flush the volume file once the 
multi-write is complete. Multi-write latencies are very 
flat and stable. 

The latency of reads on a read-only box is also fairly 
stable even as the volume of traffic varies significantly 
(up to 3x over the 3 week period). For a write-enabled 
box the read performance is impacted by three primary 
factors. First, as the number of photos stored on the ma- 
chine increases, the read traffic to that machine also in- 
creases (compare week-over-week traffic in figure 10). 
Second, photos on write-enabled machines are cached 
in the Cache while they are not cached for a read-only 
machine>. This suggests that the buffer cache would be 
more effective for a read-only machine. Third, recently 
written photos are usually read back immediately be- 
cause Facebook highlights recent content. Such reads on 
Write-enabled boxes will always hit in the buffer cache 
and improve the hit rate of the buffer cache. The shape 
of the line in the figure is the result of a combination of 
these three factors. 

The CPU utilization on the Store machines is low. 
CPU idle time varies between 92-96%. 


5 Related Work 


To our knowledge, Haystack targets a new design point 
focusing on the long tail of photo requests seen by a 


3Note that for traffic coming through a CDN, they are cached in 
the CDNs and not in the Cache in both instances 
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large social networking website. 


Filesystems Haystack takes after log-structured filesys- 
tems [23] which Rosenblum and Ousterhout designed 
to optimize write throughput with the idea that most 
reads could be served out of cache. While measure- 
ments [3] and simulations [6] have shown that log- 
structured filesystems have not reached their full poten- 
tial in local filesystems, the core ideas are very relevant 
to Haystack. Photos are appended to physical volume 
files in the Haystack Store and the Haystack Cache shel- 
ters write-enabled machines from being overwhelmed 
by the request rate for recently uploaded data. The key 
differences are (a) that the Haystack Store machines 
write their data in such a way that they can efficiently 
serve reads once they become read-only and (b) the read 
request rate for older data decreases over time. 


Several works [8, 19, 28] have proposed how to 
manage small files and metadata more efficiently. The 
common thread across these contributions is how to 
group related files and metadata together intelligently. 
Haystack obviates these problems since it maintains 
metadata in main memory and users often upload 
related photos in bulk. 


Object-based storage Haystack’s architecture shares 
many similarities with object storage systems proposed 
by Gibson et al. [10] in Network-Attached Secure Disks 
(NASD). The Haystack Directory and Store are perhaps 
most similar to the File and Storage Manager concepts, 
respectively, in NASD that separate the logical storage 
units from the physical ones. In OBFS [25], Wang et 
al. build a user-level object-based filesystem that is sah 
the size of XFS. Although OBFS achieves greater write 
throughput than XFS, its read throughput (Haystack’s 
main concern) is slightly worse. 





Managing metadata Weil et al. [26, 27] address 
scaling metadata management in Ceph, a petabyte-scale 
object store. Ceph further decouples the mapping from 
logical units to physical ones by introducing generating 
functions instead of explicit mappings. Clients can cal- 
culate the appropriate metadata rather than look it up. 
Implementing this technique in Haystack remains future 
work. Hendricks et. al [13] observe that traditional 
metadata pre-fetching algorithms are less effective for 
object stores because related objects, which are identi- 
fied by a unique number, lack the semantic groupings 
that directories implicitly impose. Their solution is to 
embed inter-object relationships into the object id. This 
idea is orthogonal to Haystack as Facebook explicitly 
stores these semantic relationships as part of the social 
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graph. In Spyglass [15], Leung et al. propose a design 
for quickly and scalably searching through metadata 
of large-scale storage systems. Manber and Wu also 
propose a way to search through entire filesystems in 
GLIMPSE [17]. Patil et al. [20] use a sophisticated 
algorithm in GIGA+ to manage the metadata associated 
with billions of files per directory. We engineered a 
simpler solution than many existing works as Haystack 
does not have to provide search features nor traditional 
UNIX filesystem semantics. 


Distributed filesystems Haystack’s notion of a logi- 
cal volume is similar to Lee and Thekkath’s [14] vir- 
tual disks in Petal. The Boxwood project [16] explores 
using high-level data structures as the foundation for 
storage. While compelling for more complicated al- 
gorithms, abstractions like B-trees may not have high 
impact on Haystack’s intentionally lean interface and 
semantics. Similarly, Sinfonia’s [1] mini-transactions 
and PNUTS’s [5] database functionality provide more 
features and stronger guarantees than Haystack needs. 
Ghemawat et al. [9] designed the Google File System 
for a workload consisting mostly of append operations 
and large sequential reads. Bigtable [4] provides a stor- 
age system for structured data and offers database-like 
features for many of Google’s projects. It is unclear 
whether many of these features make sense in a system 
optimized for photo storage. 


6 Conclusion 


This paper describes Haystack, an object storage sys- 
tem designed for Facebook’s Photos application. We de- 
signed Haystack to serve the long tail of requests seen 
by sharing photos in a large social network. The key 
insight is to avoid disk operations when accessing meta- 
data. Haystack provides a fault-tolerant and simple solu- 
tion to photo storage at dramatically less cost and higher 
throughput than a traditional approach using NAS appli- 
ances. Furthermore, Haystack is incrementally scalable, 
a necessary quality as our users upload hundreds of mil- 
lions of photos each week. 
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Abstract 


Highly available cloud storage is often implemented with 
complex, multi-tiered distributed systems built on top 
of clusters of commodity servers and disk drives. So- 
phisticated management, load balancing and recovery 
techniques are needed to achieve high performance and 
availability amidst an abundance of failure sources that 
include software, hardware, network connectivity, and 
power issues. While there is a relative wealth of fail- 
ure studies of individual components of storage systems, 
such as disk drives, relatively little has been reported so 
far on the overall availability behavior of large cloud- 
based storage services. 

We characterize the availability properties of cloud 
storage systems based on an extensive one year study of 
Google’s main storage infrastructure and present statis- 
tical models that enable further insight into the impact 
of multiple design choices, such as data placement and 
replication strategies. With these models we compare 
data availability under a variety of system parameters 
given the real patterns of failures observed in our fleet. 


1 Introduction 


Cloud storage is often implemented by complex multi- 
tiered distributed systems on clusters of thousands of 
commodity servers. For example, in Google we run 
Bigtable [9], on GFS [16], on local Linux file systems 
that ultimately write to local hard drives. Failures in any 
of these layers can cause data unavailability. 

Correctly designing and optimizing these multi- 
layered systems for user goals such as data availability 
relies on accurate models of system behavior and perfor- 
mance. In the case of distributed storage systems, this 
includes quantifying the impact of failures and prioritiz- 
ing hardware and software subsystem improvements in 
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the datacenter environment. 

We present models we derived from studying a year of 
live operation at Google and describe how our analysis 
influenced the design of our next generation distributed 
storage system [22]. 

Our work is presented in two parts. First, we measured 
and analyzed the component availability, e.g. machines, 
racks, multi-racks, in tens of Google storage clusters. In 
this part we: 


e Compare mean time to failure for system compo- 
nents at different granularities, including disks, ma- 
chines and racks of machines. (Section 3) 


e Classify the failure causes for storage nodes, their 
characteristics and contribution to overall unavail- 
ability. (Section 3) 


e Apply a clustering heuristic for grouping failures 
which occurs almost simultaneously and show that 
a large fraction of failures happen in bursts. (Sec- 
tion 4) 


e Quantify how likely a failure burst is associated 
with a given failure domain. We find that most large 
bursts of failures are associated with rack- or multi- 
rack level events. (Section 4) 


Based on these results, we determined that the criti- 
cal element in models of availability is their ability to 
account for the frequency and magnitude of correlated 
failures. 

Next, we consider data availability by analyzing un- 
availability at the distributed file system level, where one 
file system instance is referred to as a cell. We apply two 
models of multi-scale correlated failures for a variety of 
replication schemes and system parameters. In this part 
we: 


e Demonstrate the importance of modeling correlated 
failures when predicting availability, and show their 
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impact under a variety of replication schemes and 
placement policies. (Sections 5 and 6) 


e Formulate a Markov model for data availability, that 
can scale to arbitrary cell sizes, and captures the in- 
teraction of failures with replication policies and re- 
covery times. (Section 7) 


e Introduce multi-cell replication schemes and com- 
pare the availability and bandwidth trade-offs 
against single-cell schemes. (Sections 7 and 8) 


e Show the impact of hardware failure on our cells is 
significantly smaller than the impact of effectively 
tuning recovery and replication parameters. (Sec- 
tion 8) 


Our results show the importance of considering 
cluster-wide failure events in the choice of replication 
and recovery policies. 


2 Background 


We study end to end data availability in a cloud com- 
puting storage environment. These environments often 
use loosely coupled distributed storage systems such as 
GFS [1, 16] due to the parallel I/O and cost advantages 
they provide over traditional SAN and NAS solutions. A 
few relevant characteristics of such systems are: 


e Storage server programs running on physical ma- 
chines in a datacenter, managing local disk storage 
on behalf of the distributed storage cluster. We refer 
to the storage server programs as storage nodes or 
nodes. 

e A pool of storage service masters managing data 
placement, load balancing and recovery, and moni- 
toring of storage nodes. 

e A replication or erasure code mechanism for user 
data to provide resilience to individual component 
failures. 


A large collection of nodes along with their higher 
level coordination processes [17] are called a cell or 
storage cell. These systems usually operate in a shared 
pool of machines running a wide variety of applications. 
A typical cell may comprise many thousands of nodes 
housed together in a single building or set of colocated 
buildings. 


2.1. Availability 


A storage node becomes unavailable when it fails to re- 
spond positively to periodic health checking pings sent 
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Figure 1: Cumulative distribution function of the duration of 
node unavailability periods. 


by our monitoring system. The node remains unavail- 
able until it regains responsiveness or the storage system 
reconstructs the data from other surviving nodes. 

Nodes can become unavailable for a large number of 
reasons. For example, a storage node or networking 
switch can be overloaded; a node binary or operating 
system may crash or restart; a machine may experience 
a hardware error; automated repair processes may tem- 
porarily remove disks or machines; or the whole clus- 
ter could be brought down for maintenance. The vast 
majority of such unavailability events are transient and 
do not result in permanent data loss. Figure | plots the 
CDF of node unavailability duration, showing that less 
than 10% of events last longer than 15 minutes. This 
data is gathered from tens of Google storage cells, each 
with 1000 to 7000 nodes, over a one year period. The 
cells are located in different datacenters and geographi- 
cal regions, and have been used continuously by different 
projects within Google. We use this dataset throughout 
the paper, unless otherwise specified. 

Experience shows that while short unavailability 
events are most frequent, they tend to have a minor im- 
pact on cluster-level availability and data loss. This is 
because our distributed storage systems typically add 
enough redundancy to allow data to be served from other 
sources when a particular node is unavailable. Longer 
unavailability events, on the other hand, make it more 
likely that faults will overlap in such a way that data 
could become unavailable at the cluster level for long 
periods of time. Therefore, while we track unavailabil- 
ity metrics at multiple time scales in our system, in this 
paper we focus only on events that are 15 minutes or 
longer. This interval is long enough to exclude the ma- 
jority of benign transient events while not too long to ex- 
clude significant cluster-wide phenomena. As in [11], we 
observe that initiating recovery after transient failures is 
inefficient and reduces resources available for other op- 
erations. For these reasons, GFS typically waits 15 min- 
utes before commencing recovery of data on unavailable 
nodes. 
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We primarily use two metrics throughout this paper. 
The average availability of all N nodes in a cell is defined 
as: 


>in, en Uptime(N;) 


A = 
= > n,en (uptime(N;) + downtime (N;)) 





() 


We use uptime(/V;) and downtime(JV;) to refer to the 
lengths of time a node JN; is available or unavailable, re- 
spectively. The sum of availability periods over all nodes 
is called node uptime. We define uptime similarly for 
other component types. We define unavailability as the 
complement of availability. 

Mean time to failure, or MTTF, is commonly quoted 
in the literature related to the measurements of availabil- 
ity. We use MTTF for components that suffer transient 
or permanent failures, to avoid frequent switches in ter- 
minology. 


uptime 


MTTF = (2) 


Availability measurements for nodes and individual 
components in our system are presented in Section 3. 


number failures 


2.2 Data replication 


Distributed storage systems increase resilience to fail- 
ures by using replication [2] or erasure encoding across 
nodes [28]. In both cases, data is divided into a set of 
stripes, each of which comprises a set of fixed size data 
and code blocks called chunks. Data in a stripe can be re- 
constructed from some subsets of the chunks. For repli- 
cation, R = n refers to n identical chunks in a stripe, 
so the data may be recovered from any one chunk. For 
Reed-Solomon erasure encoding, RS(n,m) denotes n 
distinct data blocks and m error correcting blocks in each 
stripe. In this case a stripe may be reconstructed from any 
n chunks. 

We call a chunk available if the node it is stored on 
is available. We call a stripe available if enough of its 
chunks are available to reconstruct the missing chunks, 
if any. 

Data availability is a complex function of the individ- 
ual node availability, the encoding scheme used, the dis- 
tribution of correlated node failures, chunk placement, 
and recovery times that we will explore in the second part 
of this paper. We do not explore related mechanisms for 
dealing with failures, such as additional application level 
redundancy and recovery, and manual component repair. 


3 Characterizing Node Availability 


Anything that renders a storage node unresponsive is 
a potential cause of unavailability, including hardware 


component failures, software bugs, crashes, system re- 
boots, power loss events, and loss of network connec- 
tivity. We include in our analysis the impact of software 
upgrades, reconfiguration, and other maintenance. These 
planned outages are necessary in a fast evolving datacen- 
ter environment, but have often been overlooked in other 
availability studies. In this section we present data for 
storage node unavailability and provide some insight into 
the main causes for unavailability. 


3.1 Numbers from the fleet 


Failure patterns vary dramatically across different hard- 
ware platforms, datacenter operating environments, and 
workloads. We start by presenting numbers for disks. 

Disks have been the focus of several other studies, 
since they are the system component that permanently 
stores the data, and thus a disk failure potentially results 
in permanent data loss. The numbers we observe for disk 
and storage subsystem failures, presented in Table 2, are 
comparable with what other researchers have measured. 
One study [29] reports ARR (annual replacement rate) 
for disks between 2% and 4%. Another study [19] fo- 
cused on storage subsystems, thus including errors from 
shelves, enclosures, physical interconnects, protocol fail- 
ures, and performance failures. They found AFR (annual 
failure rate) generally between 2% and 4%, but for some 
storage systems values ranging between 3.9% and 8.3%. 

For the purposes of this paper, we are interested in 
disk errors as perceived by the application layer. This 
includes latent sector errors and corrupt sectors on disks, 
as well as errors caused by firmware, device drivers, con- 
trollers, cables, enclosures, silent network and memory 
corruption, and software bugs. We deal with these er- 
rors with background scrubbing processes on each node, 
as in [5, 31], and by verifying data integrity during client 
reads [4]. Background scrubbing in GFS finds between 
1 in 10° to 10’ of older data blocks do not match the 
checksums recorded when the data was originally writ- 
ten. However, these cell-wide rates are typically concen- 
trated on a small number of disks. 

We are also concerned with node failures in addition 
to individual disk failures. Figure 2 shows the distribu- 
tion of three mutually exclusive causes of node unavail- 
ability in one of our storage cells. We focus on node 
restarts (software restarts of the storage program running 
on each machine), planned machine reboots (e.g. ker- 
nel version upgrades), and unplanned machine reboots 
(e.g. kernel crashes). For the purposes of this figure we 
do not exclude events that last less than 15 minutes, but 
we still end the unavailability period when the system 
reconstructs all the data previously stored on that node. 
Node restart events exhibit the greatest variability in du- 
ration, ranging from less than one minute to well over an 
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Figure 2: Cumulative distribution function of node unavailabil- 
ity durations by cause. 
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Figure 3: Rate of events per 1000 nodes per day, for one exam- 
ple cell. 


hour, though they usually have the shortest duration. Un- 
planned reboots have the longest average duration since 
extra checks or corrective action is often required to re- 
store machines to a safe state. 


Figure 3 plots the unavailability events per 1000 nodes 
per day for one example cell, over a period of three 
months. The number of events per day, as well as the 
number of events that can be attributed to a given cause 
vary significantly over time as operational processes, 
tools, and workloads evolve. Events we cannot classify 
accurately are labeled unknown. 

The effect of machine failures on availability is de- 
pendent on the rate of failures, as well as on how long 
the machines stay unavailable. Figure 4 shows the node 
unavailability, along with the causes that generated the 
unavailability, for the same cell used in Figure 3. The 
availability is computed with a one week rolling window, 
using definition (1). We observe that the majority of un- 
availability is generated by planned reboots. 
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Figure 4: Storage node unavailability computed with a one 
week rolling window, for one example cell. 





Cause Unavailability (%) 


average / min / max 





Node restarts 0.0139 / 0.0004 / 0.1295 


Planned machine reboots 0.0154 / 0.0050 / 0.0563 
Unplanned machine reboots | 0.0025 / 0.0000 / 0.0122 
Unknown 0.0142 / 0.0013 / 0.0454 














Table 1: Unavailability attributed to different failure causes, 
over the full set of cells. 


Table 1 shows the unavailability from node restarts, 
planned and unplanned machine reboots, each of which 
is a significant cause. The numbers are exclusive, thus 
the planned machine reboots do not include node restarts. 

Table 2 shows the MTTF for a series of important 
components: disk, nodes, and racks of nodes. The num- 
bers we report for component failures are inclusive of 
software errors and hardware failures. Though disks fail- 
ures are permanent and most node failures are transitory, 
the significantly greater frequency of node failures makes 
them a much more important factor for system availabil- 
ity (Section 8.4). 


4 Correlated Failures 


The co-occurring failure of a large number of nodes 
can reduce the effectiveness of replication and encoding 
schemes. Therefore it is critical to take into account the 
Statistical behavior of correlated failures to understand 
data availability. In this section we are more concerned 
with measuring the frequency and severity of such fail- 
ures rather than root causes. 
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Table 2: Component failures across several Google cells. 
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Figure 5: Seven node failures clustered into two failure bursts 
when the window size is 2 minutes. Note how only the unavail- 
ability start times matter. 


We define a failure burst and examine features of these 
bursts in the field. We also develop a method for identi- 
fying which bursts are likely due to a failure domain. By 
failure domain, we mean a set of machines which we ex- 
pect to simultaneously suffer from a common source of 
failure, such as machines which share a network switch 
or power cable. We demonstrate this method by validat- 
ing physical racks as an important failure domain. 


4.1 Defining failure bursts 


We define a failure burst with respect to a window size 
w as a maximal sequence of node failures, each one oc- 
curring within a time window w of the next. Figure 5 
illustrates the definition. We choose w = 120 s, for sev- 
eral reasons. First, it is longer than the frequency with 
which nodes are periodically polled in our system for 
their status. A window length smaller than the polling 
interval would not make sense as some pairs of events 
which actually occur within the window length of each 
other would not be correctly associated. Second, it is less 
than a tenth of the average time it takes our system to re- 
cover a chunk, thus, failures within this window can be 
considered as nearly concurrent. Figure 6 shows the frac- 
tion of individual failures that get clustered into bursts of 
at least 10 nodes as the window size changes. Note that 
the graph is relatively flat after 120 s, which is our third 
reason for choosing this value. 

Since failures are clustered into bursts based on their 
times of occurrence alone, there is a risk that two bursts 
with independent causes will be clustered into a single 
burst by chance. The slow increase in Figure 6 past 120s 
illustrates this phenomenon. The error incurred is small 
as long as we keep the window size small. Given a win- 
dow size of 120 s and the set of bursts obtained from it, 
the probability that a random failure gets included in a 
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Figure 6: Effect of the window size on the fraction of individual 
failures that get clustered into bursts of at least 10 nodes. 


burst (as opposed to becoming its own singleton burst) 
is 8.0%. When this inclusion happens, most of the time 
the random failure is combined with a singleton burst to 
form a burst of two nodes. The probability that a random 
failure gets included in a burst of at least 10 nodes is only 
0.068%. For large bursts, which contribute most unavail- 
ability as we will see in Section 5.2, the fraction of nodes 
affected is the significant quantity and changes insignifi- 
cantly if a burst of size one or two nodes is accidentally 
clustered with it. 

Using this definition, we observe that 37% of failures 
are part of a burst of at least 2 nodes. Given the result 
above that only 8.0% of non-correlated failures may be 
incorrectly clustered, we are confident that close to 37% 
of failures are truly correlated. 


4.2 Views of failure bursts 


Figure 7 shows the accumulation of individual failures in 
bursts. For clarity we show all bursts of size at least 10 
seen over a 60 day period in an example cell. In the plot, 
each burst is displayed with a separate shape. The n-th 
node failure that joins a burst at time ¢,, is said to have 
ordinal n — 1 and is plotted at point (¢,, — 1). Two 
broad classes of failure bursts can be seen in the plot: 


1. Those failure bursts that are characterized by a large 
number of failures in quick succession show up as 
steep lines with a large number of nodes in the burst. 
Such failures can be seen, for example, following a 
power outage in a datacenter. 


2. Those failure bursts that are characterized by a 
smaller number of nodes failing at a slower rate 
at evenly spaced intervals. Such correlated failures 
can be seen, for example, as part of rolling reboot 
or upgrade activity at the datacenter management 
layer. 


Figure 8 displays the bursts sorted by the number of 
nodes and racks that they affect. The size of each bubble 
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Figure 7: Development of failure bursts in one example cell. 


indicates the frequency of each burst group. The group- 
ing of points along the 45° line represent bursts where 
as many racks are affected as nodes. The points furthest 
away from this line represent the most rack-correlated 
failure bursts. For larger bursts of at least 10 nodes, we 
find only 3% have all their nodes on unique racks. We 
introduce a metric to quantify this degree of domain cor- 
relation in the next section. 


4.3 Identifying domain-related failures 


Domain-related issues, such those associated with phys- 
ical racks, network switches and power domains, are fre- 
quent causes of correlated failure. These problems can 
sometimes be difficult to detect directly. We introduce 
a metric to measure the likelihood that a failure burst is 
domain-related, rather than random, based on the pat- 
tern of failure observed. The metric can be used as an 
effective tool for identifying causes of failures that are 
connected to domain locality. It can also be used to eval- 
uate the importance of domain diversity in cell design 
and data placement. We focus on detecting rack-related 
node failures in this section, but our methodology can be 
applied generally to any domain and any type of failure. 


Let a failure burst be encoded as an n-tuple 
(ki, ke,...,kn), where ky < kg < ... < kp. Each 
k, gives the number of nodes affected in the 7-th rack af- 
fected, where racks are ordered so that these values are 
increasing. This rack-based encoding captures all rele- 
vant information about the rack locality of the burst. Let 
the size of the burst be the number of nodes that are af- 
fected, ie., )>;"_, kj. We define the rack-affinity score of 
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Figure 8: Frequency of failure bursts sorted by racks and nodes 
affected. 
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Note that this is the number of ways of choosing two 
nodes from the burst within the same rack. The score 
allows us to compare the rack concentration of bursts of 
the same size. For example the burst (1,4) has score 6. 
The burst (1,1, 1,2) has score 1 which is lower. There- 
fore, the first burst is more concentrated by rack. Possi- 
ble alternatives for the score include the sum of squares 
Soy, k? or the negative entropy 5>;"_, k; log(k;). The 
sum of squares formula is equivalent to our chosen score 
because for a fixed burst size, the two formulas are re- 
lated by an affine transform. We believe the entropy- 
inspired formula to be inferior because its log factor 
tends to downplay the effect of a very large k;. Its real- 
valued score is also a problem for the dynamic program 
we use later in computation. 

We define the rack affinity of a burst in a particular cell 
to be the probability that a burst of the same size affecting 
randomly chosen nodes in that cell will have a smaller 
burst score, plus half the probability that the two scores 
are equal, to eliminate bias. Rack affinity is therefore a 
number between 0 and 1 and can be interpreted as a ver- 
tical position on the cumulative distribution of the scores 
of random bursts of the same size. It can be shown that 
for a random burst, the expected value of its rack affin- 
ity is exactly 0.5. So we define a rack-correlated burst 
to be one with a metric close to 1, a rack-uncorrelated 
burst to be one with a metric close to 0.5, and a rack- 
anti-correlated burst to be one with a metric close to 0 
(we have not observed such a burst). It is possible to ap- 
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proximate the metric using simulation of random bursts. 
We choose to compute the metric exactly using dynamic 
programming because the extra precision it provides al- 
lows us to distinguish metric values very close to 1. 

We find that, in general, larger failure bursts have 
higher rack affinity. All our failure bursts of more than 
20 nodes have rack affinity greater than 0.7, and those 
of more than 40 nodes have affinity at least 0.9. It is 
worth noting that some bursts with high rack affinity do 
not affect an entire rack and are not caused by common 
network or power issues. This could be the case for a 
bad batch of components or new storage node binary or 
kernel, whose installation is only slightly correlated with 
these domains. 


5 Coping with Failure 


We now begin the second part of the paper where we 
transition from node failures to analyzing replicated data 
availability. Two methods for coping with the large num- 
ber of failures described in the first part of this paper 
include data replication and recovery, and chunk place- 
ment. 


5.1 Data replication and recovery 


Replication or erasure encoding schemes provide re- 
silience to individual node failures. When a node fail- 
ure causes the unavailability of a chunk within a stripe, 
we initiate a recovery operation for that chunk from the 
other available chunks remaining in the stripe. 

Distributed filesystems will necessarily employ 
queues for recovery operations following node failure. 
These queues prioritize reconstruction of stripes which 
have lost the most chunks. The rate at which missing 
chunks may be recovered is limited by the bandwidth of 
individual disks, nodes, and racks. Furthermore, there 
is an explicit design tradeoff in the use of bandwidth 
for recovery operations versus serving client read/write 
requests. 
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Figure 9: Example chunk recovery after failure bursts. 


This limit is particularly apparent during correlated 
failures when a large number of chunks go missing at the 
same time. Figure 9 shows the recovery delay after a fail- 
ure burst of 20 storage nodes affecting millions of stripes. 
Operators may adjust the rate-limiting seen in the figure. 
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Figure 10: Stripe MTTF due to different burst sizes. Burst sizes 
are defined as a fraction of all nodes: small (0-0.001), medium 
(0.001-0.01), large (0.01-0.1). For each size, the left column 
represents uniform random placement, and the right column 
represents rack-aware placement. 


The models presented in the following sections allow us 
to measure the sensitivity of data availability to this rate- 
limit and other parameters, described in Section 8. 


5.2 Chunk placement and stripe unavailability 


To mitigate the effect of large failure bursts in a single 
failure domain we consider known failure domains when 
placing chunks within a stripe on storage nodes. For ex- 
ample, racks constitute a significant failure domain to 
avoid. A rack-aware policy is one that ensures that no 
two chunks in a stripe are placed on nodes in the same 
rack. 

Given a failure burst, we can compute the expected 
fraction of stripes made unavailable by the burst. More 
generally, we compute the probability that exactly k 
chunks are affected in a stripe of size n, which is es- 
sential to the Markov model of Section 7. Assuming that 
stripes are uniformly distributed across nodes of the cell, 
this probability is a ratio where the numerator is the num- 
ber of ways to place a stripe of size n in the cell such 
that exactly k of its chunks are affected by the burst, and 
the denominator is the total number of ways to place a 
stripe of size n in the cell. These numbers can be com- 
puted combinatorially. The same ratio can be used when 
chunks are constrained by a placement policy, in which 
case the numerator and denominator are computed using 
dynamic programming. 

Figure 10 shows the stripe MTTF for three classes of 
burst size. For each class of bursts we calculate the av- 
erage fraction of stripes affected per burst and the rate 
of bursts, to get the combined MTTF due to that class. 
We see that for all encodings except R = 1, large fail- 
ure bursts are the biggest contributor to unavailability 
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despite the fact that they are much rarer. We also see 
that for small and medium bursts sizes, and large encod- 
ings, using a rack-aware placement policy increases the 
stripe MTTF by a factor of 3 typically. This is a signifi- 
cant gain considering that in uniform random placement, 
most stripes end up with their chunks on different racks 
due to chance. 


6 Cell Simulation 


This section introduces a trace-based simulation method 
for calculating availability in a cell. The method replays 
observed or synthetic sequences of node failures and cal- 
culates the resulting impact on stripe availability. It of- 
fers detailed view of availability in short time frames. 


For each node, the recorded events of interest are 
down, up and recovery complete events. When all nodes 
are up, they are each assumed to be responsible for an 
equal number of chunks. When a node goes down it 
is still responsible for the same number of chunks until 
15 minutes later when the chunk recovery process starts. 
For simplicity and conservativeness, we assume that all 
these chunks remain unavailable until the recovery com- 
plete event. A more accurate model could model recov- 
ery too, such as by reducing the number of unavailable 
chunks linearly until the recovery complete event, or by 
explicitly modelling recovery queues. 


We are interested in the expected number of stripes 
that are unavailable for at least 15 minutes, as a function 
of time. Instead of simulating a large number of stripes, 
it is more efficient to simulate all possible stripes, and use 
combinatorial calculations to obtain the expected number 
of unavailable stripes given a set of down nodes, as was 
done in Section 5.2. 


As a validation, we can run the simulation using the 
stripe encodings that were in use at the time to see if the 
predicted number of unavailable stripes matches the ac- 
tual number of unavailable stripes as measured by our 
storage system. Figure 11 shows the result of such a 
simulation. The prediction is a linear combination of the 
predictions for individual encodings present, in this case 
mostly RS(5,3) and R = 3. 


Analysis of hypothetical scenarios may also be made 
with the cell simulator, such as the effect of encoding 
choice and of chunk recovery rate. Although we may 
not change the frequency and severity of bursts in an ob- 
served sequence, bootstrap methods [13] may be used 
to generate synthetic failure traces with different burst 
characteristics. This is useful for exploring sensitivity to 
these events and the impact of improvements in datacen- 
ter reliability. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


0.001 


Measured ——— ] 
Predicted 


0.0001 
1e-05 
1e-06 
1e-07 
1e-08 


1e-09 


Fraction of unavailable stripes 


te-10 








18:00 


1e-11 
0:00 6:00 12:00 


Time of day 


24:00 


Figure 11: Unavailability prediction over time for a particular 
cell for a day with large failure bursts. 


7 Markov Model of Stripe Availability 


In this section, we formulate a Markov model of data 
availability. The model captures the interaction of dif- 
ferent failure types and production parameters with more 
flexibility than is possible with the trace-based simula- 
tion described in the previous section. Although the 
model makes assumptions beyond those in the trace- 
based simulation method, it has certain advantages. First, 
it allows us to model and understand the impact of 
changes in hardware and software on end-user data avail- 
ability. There are typically too many permutations of sys- 
tem changes and encodings to test each in a live cell. The 
Markov model allows us to reason directly about the con- 
tribution to data availability of each level of the storage 
stack and several system parameters, so that we can eval- 
uate tradeoffs. Second, the systems we study may have 
unavailability rates that are so low they are difficult to 
measure directly. The Markov model handles rare events 
and arbitrarily low stripe unavailability rates efficiently. 

The model focuses on the availability of a representa- 
tive stripe. Let s be the total number of chunks in the 
stripe, and r be the minimum number of chunks needed 
to recover that stripe. As described in Section 2.2, r = 1 
for replicated data and r = n for RS(n,m) encoded 
data. The state of a stripe is represented by the number of 
available chunks. Thus, the states are s,s—1,...,r,r—1 
with the state r — 1 representing all of the unavailable 
states where the stripe has less than the required r chunks 
available. Figure 12 shows a Markov chain correspond- 
ing to an R = 2 stripe. 

The Markov chain transitions are specified by the rates 
at which a stripe moves from one state to another, due to 
chunk failures and recoveries. Chunk failures reduce the 
number of available chunks, and several chunks may fail 
‘simultaneously’ in a failure burst event. Balancing this, 
recoveries increase the number of available chunks if any 
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Figure 12: The Markov chain for a stripe encoded using R = 2. 


are unavailable. 

A key assumption of the Markov model is that events 
occur independently and with constant rates over time. 
This independence assumption, although strong, is not 
the same as the assumption that individual chunks fail 
independently of each other. Rather, it implies that fail- 
ure events are independent of each other, but each event 
may involve multiple chunks. This allows a richer and 
more flexible view of the system. It also implies that re- 
covery rates for a stripe depend only on its own current 
state. 

In practice, failure events are not always independent. 
Most notably, it has been pointed out in [29] that the time 
between disk failures is not exponentially distributed and 
exhibits autocorrelation and long-range dependence. The 
Weibull distribution provides a much better fit for disk 
MTTF. 

However, the exponential distribution is a reason- 
able approximation for the following reasons. First, the 
Weibull distribution is a generalization of the exponen- 
tial distribution that allows the rate parameter to increase 
over time to reflect the aging of disks. In a large pop- 
ulation of disks, the mixture of disks of different ages 
tends to be stable, and so the average failure rate in a 
cell tends to be constant. When the failure rate is stable, 
the Weibull distribution provides the same quality of fit 
as the exponential. Second, disk failures make up only 
a small subset of failures that we examined, and model 
results indicate that overall availability is not particularly 
sensitive to them. Finally, other authors ([24]) have con- 
cluded that correlation and non-homogeneity of the re- 
covery rate and the mean time to a failure event have 
a much smaller impact on system-wide availability than 
the size of the event. 


7.1 Construction of the Markov chain 


We compute the transition rate due to failures using ob- 
served failure events. Let \ denote the rate of failure 
events affecting chunks, including node and disk failures. 
For any observed failure event we compute the probabil- 
ity that it affects & chunks out of the 7 available chunks in 
a Stripe. As in Section 6, for failure bursts this computa- 
tion takes into account the stripe placement strategy. The 
rate and severity of bursts, node, disk, and other failures 


may be adjusted here to suit the system parameters under 
exploration. 

Averaging these probabilities over all failures events 
gives the probability, p;,;, that a random failure event will 
affect i—7 out of 2 available chunks in a stripe. This gives 
a rate of transition from state 7 to state 7 < 7, of Ai; = 
Api,j fors > i > j > rand Ayy-1 = A a Dig 
for the rate of reaching the unavailable state. Note that 
transitions from a state to itself are ignored. 

For chunk recoveries, we assume a fixed rate of p for 
recovering a single chunk, i.e. moving from a state 7 to 
7+ 1, where r <2 < s. In particular, this means we as- 
sume that the recovery rate does not depend on the total 
number of unavailable chunks in the cell. This is justi- 
fied by setting p to a lower bound for the rate of recovery, 
based on observed recovery rates across our storage cells 
or proposed system performance parameters. While par- 
allel recovery of multiple chunks from a stripe is possi- 
ble, p:,i41 = (s — 1)p, we model serial recovery to gain 
more conservative estimates of stripe availability. 

As with [12], the distributed systems we study use pri- 
oritized recovery for stripes with more than one chunk 
unavailable. Our Markov model allows state-dependent 
recovery that captures this prioritization, but for ease of 
exposition we do not use this added degree of freedom. 

Finally, transition rates between pairs of states not 
mentioned are zero. 

With the Markov chain thus completely specified, 
computing the MTTF of a stripe, as the mean time to 
reach the ‘unavailable state’ r — 1 starting from state s, 
follows by standard methods [27]. 


7.2 Extension to multi-cell replication 


The models introduced so far can be extended to compute 
the availability of multi-cell replication schemes. An ex- 
ample of such a scheme is R = 3 x 2, where six replicas 
of the data are distributed as R = 3 replication in each of 
two linked cells. If data becomes unavailable at one cell 
then it is automatically recovered from another linked 
cell. These cells may be placed in separate datacenters, 
even on separate continents. Reed-Solomon codes may 
also be used, giving schemes such as RS(6,3) x 3 for 
three cells each with a RS(6,3) encoding of the data. 
We do not consider here the case when individual chunks 
may be combined from multiple cells to recover data, or 
other more complicated multi-cell encodings. 

We compute the availability of stripes that span cells 
by building on the Markov model just presented. Intu- 
itively, we treat each cell as a ‘chunk’ in the multi-cell 
‘stripe’, and compute its availability using the Markov 
model. We assume that failures at different data centers 
are independent, that is, that they lack a single point of 
failure such as a shared power plant or network link. Ad- 
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ditionally, when computing the cell availability, we ac- 
count for any cell-level or datacenter-level failures that 
would affect availability. 

We build the corresponding transition matrix that 
models the resulting multi-cell availability as follows. 
We start from the transition matrices M; for each cell, 
as explained in the previous section. We then build the 
transition matrix for the combined scheme as the tensor 
product of these, ®; M,, plus terms for whole cell fail- 
ures, and for cross-cell recoveries if the data becomes 
unavailable in some cells but is still available in at least 
one cell. However, it is a fair approximation to simply 
treat each cell as a highly-reliable chunk in a multi-cell 
stripe, as described above. 

Besides symmetrical cases, such as R = 3 x 2 repli- 
cation, we can also model inhomogeneous replication 
schemes, such as one cell with R = 3 and one with 
R = 2. The state space of the Markov model is the 
product of the state space for each cell involved, but may 
be approximated again by simply counting how many of 
each type of cell is available. 

A point of interest here is the recovery bandwidth be- 
tween cells, quantified in Section 8.5. Bandwidth be- 
tween distant cells has significant cost which should 
be considered when choosing a multi-cell replication 
scheme. 


8 Markov Model Findings 


In this section, we apply the Markov models described 
above to understand how changes in the parameters of 
the system will affect end-system availability. 


8.1 Markov model validation 


We validate the Markov model by comparing MTTF pre- 
dicted by the model with actual MTTF values observed 
in production cells. We are interested in whether the 
Markov model provides an adequate tool for reasoning 
about stripe availability. Our main goal in using the 
model is providing a relative comparison of competing 
storage solutions, rather than a highly accurate predic- 
tion of any particular solution. 

We underline two observations that surface from val- 
idation. First, the model is able to capture well the ef- 
fect of failure bursts, which we consider as having the 
most impact on the availability numbers. For the cells we 
observed, the model predicted MTTF with the same or- 
der of magnitude as the measured MTTF. In one particu- 
lar cell, besides more regular unavailability events, there 
was a large failure burst where tens of nodes became un- 
available. This resulted in an MTTF of 1.76E+6 days, 
while the model predicted 5E+6 days. Though the rela- 
tive error exceeds 100%, we are satisfied with the model 
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accuracy, since it still gives us a powerful enough tool to 
make decisions, as can be seen in the following sections. 

Second, the model can distinguish between failure 
bursts that span racks, and thus pose a threat to availabil- 
ity, and those that do not. If one rack goes down, then 
without other events in the cell, the availability of stripes 
with R=3 replication will not be affected, since the stor- 
age system ensures that chunks in each stripe are placed 
on different racks. For one example cell, we noticed tens 
of medium sized failure bursts that affected one or two 
racks. We expected the availability of the cell to stay 
high, and indeed we measured MTTF = 29.52E+8 days. 
The model predicted 5.77E+8 days. Again, the relative 
error is significant, but for our purposes the model pro- 
vides sufficiently accurate predictions. 

Validating the model for all possible replication and 
Reed-Solomon encodings is infeasible, since our produc- 
tion cells are not set up to cover the complete space of 
options. However, because of our large number of pro- 
duction cells we are able to validate the model over a 
range of encodings and operating conditions. 


8.2 Importance of recovery rate 


To develop some intuition about the sensitivity of stripe 
availability to recovery rate, consider the situation where 
there are no failure bursts. Chunks fail independently 
with rate \ and recover with rate p. As in the previous 
section, consider a stripe with s chunks total which can 
survive losing at most s—r chunks, such as RS(r, 5 — r). 
Thus the transition rate from state 2 > r to state i — 1 is 
iA, and from state i toi+1lis pforr >i<s. 

We compute the MTTF, given by the time taken to 
reach state r — | starting in state s. Using standard meth- 
ods related to Gambler’s Ruin, [8, 14, 15, 26], this comes 
to: 


where (a) (4) denotes (a)(a — 1)(a — 2)---(a—b+1). 
Assuming recoveries take much less time than node 
MTTF (i.e. p >> A), gives a stripe MTTF of: 


eo 1 | O (—) 
Aree (Ss) a 
By similar computations, the recovery bandwidth con- 
sumed is approximately As per r data chunks. 

Thus, with no correlated failures reducing recovery 
times by a factor of jz will increase stripe MTTF by a 
factor of j:? for R = 3 and by pi* for RS(9, 4). 

Reducing recovery times is effective when correlated 
failures are few. For R.S'(6, 3) with no correlated failures, 
a 10% reduction in recovery time results in a 19% reduc- 
tion in unavailability. However, when correlated failures 
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Policy MTTF(days) with | MTTF(days) w/o 

(% overhead) correlated failures | correlated failures 
R = 2 (100) L47E +5 4.99E +05 
R = 3 (200) 6.82E +6 1.35E + 09 
R=4 (300) 1.404 +8 2.75E + 12 
R=5 (400) 2.41E4+9 8.98EF + 15 
RS(4,2) (50) | 1.80E +6 1.35E +09 
RS(6,3) (50) | 1.032 +7 4.95E +12 
RS(9, 4) (44) 2.39F + 6 9.01F + 15 
RS(8,4) (50) | 5.11B +7 1.80E + 16 




















Table 3: Stripe MTTF in days, corresponding to various data 
redundancy policies and space overhead. 








Policy MTTF Bandwidth 
(recovery time) (days) (per PB) 
R=2x 2(1day) 1.08F + 10 | 6.8MB/day 
R=2x 2(1hr) 2.58E +11 | 6.8MB/day 
RS(6,3) x 21day) | 5.32h +13 | 97KB/day 
RS(6,3) x 21hr) | 1.226+15 | 97KB/day 




















Table 4: Stripe MTTF and inter-cell bandwidth, for various 
multi-cell schemes and inter-cell recovery times. 


are taken into account, even a 90% reduction in recovery 
time results in only a 6% reduction in unavailability. 


8.3 Impact of correlation on effectiveness of data- 
replication schemes 


Table 3 presents stripe availability for several data- 
replication schemes, measured in MTTF. We contrast 
this with stripe MTTF when node failures occur at the 
same total rate but are assumed independent. 

Note that failing to account for correlation of node fail- 
ures typically results in overestimating availability by at 
least two orders of magnitude, and eight in the case of 
RS(8,4). Correlation also reduces the benefit of increas- 
ing data redundancy. The gain in availability achieved 
by increasing the replication number, for example, grows 
much more slowly when we have correlated failures. 
Reed Solomon encodings achieve similar resilience to 
failures compared to replication, though with less stor- 
age overhead. 


8.4 Sensitivity of availability to component failure 
rates 


One common method for improving availability is reduc- 
ing component failure rates. By inserting altered failure 
rates of hardware into the model we can estimate the im- 
pact of potential improvements without actually building 
or deploying new hardware. 

We find that improvements below the node (server) 


layer of the storage stack do not significantly improve 
data availability. Assuming R = 3 is used, a 10% re- 
duction in the latent disk error rate has a negligible effect 
on stripe availability. Similarly, a 10% reduction in the 
disk failure rate increases stripe availability by less than 
1.5%. On the other hand, cutting node failure rates by 
10% can increase data availability by 18%. This holds 
generally for other encodings. 


8.5 Single vs multi-cell replication schemes 


Table 4 compares stripe MTTF under several multi-cell 
replication schemes and inter-cell recovery times, taking 
into consideration the effect of correlated failures within 
cells. 

Replicating data across multiple cells (data centers) 
greatly improves availability because it protects against 
correlated failures. For example, R = 2 x 2 with | day 
recovery time between cells has two orders of magnitude 
longer MTTF than R = 4, shown in Table 3. 

This introduces a tradeoff between higher replication 
in a single cell and the cost of inter-cell bandwidth. The 
extra availability for R = 2 x 2 with | day recoveries ver- 
sus R = 4 comes at an average cost of 6.8 MB/(user PB) 
copied between cells each day. This is the inverse MTTF 
for R = 2. 

It should be noted that most cross-cell recoveries will 
occur in the event of large failure bursts. This must be 
considered when calculating expected recovery times be- 
tween cells and the cost of on-demand access to poten- 
tially large amounts of bandwidth. 

Considering the relative cost of storage versus recov- 
ery bandwidth allows us to choose the most cost effective 
scheme given particular availability goals. 


9 Related Work 


Several previous studies [3, 19, 25, 29, 30] focus on the 
failure characteristics of independent hardware compo- 
nents, such as hard drives, storage subsystems, or mem- 
ory. As we have seen, these must be included when con- 
sidering availability but by themselves are insufficient. 
We focus on failure bursts, since they have a large in- 
fluence on the availability of the system. Previous litera- 
ture on failure bursts has focused on methods for discov- 
ering the relationship between the size of a failure event 
and its probability of occurrence. In [10], the existence 
of near-simultaneous failures in two large distributed sys- 
tems is reported. The beta-binomial density and the bi- 
exponential density are used to fit these distributions in 
[6] and [24], respectively. In [24], the authors further 
note that using an over-simplistic model for burst size, 
for example a single size, could result in “dramatic inac- 
curacies” in practical settings. On the other hand, even 
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though the mean time to failure and mean time to recov- 
ery of system nodes tend to be non-uniform and corre- 
lated, this particular correlation effect has only a limited 
impact on system-wide availability. 

There is limited previous work on discovering patterns 
of correlation in failures. The conditional probability of 
failures for each pair of nodes in a system has been pro- 
posed in [6] as a measure of correlation in the system. 
This computation extends heuristically to sets of larger 
nodes. A paradigm for discovering maximally indepen- 
dent groups of nodes in a system to cope with correlated 
failures is discussed in [34]. That paradigm involves col- 
lecting failure statistics on each node in the system and 
computing a measure of correlation, such as the mutual 
information, between every pair of nodes. Both of these 
approaches are computationally intensive and the results 
found, unlike ours, are not used to build a predictive an- 
alytical model for availability. 

Models that have been developed to study the relia- 
bility of long-term storage fall into two categories, non- 
Markov and Markov models. Those in the first category 
tend to be less versatile. For example, in [5] the prob- 
ability of multiple faults occurring during the recovery 
period of a stripe is approximated. Correlation is intro- 
duced by means of a multiplicative factor that is applied 
to the mean time to failure of a second chunk when the 
first chunk is already unavailable. This approach works 
only for stripes that are replicated and is not easily ex- 
tendable to Reed-Solomon encoding. Moreover, the fac- 
tor controlling time correlation is neither measurable nor 
derivable from other data. 

In [33], replication is compared with Reed-Solomon 
with respect to storage requirement, bandwidth for write 
and repair and disk seeks for reads. However, the com- 
parison assumes that sweep and repair are performed at 
regular intervals, as opposed to on demand. 

Markov models are able to capture the system much 
more generally and can be used to model both replication 
and Reed-Solomon encoding. Examples include [21], 
[32], [11] and [35]. However, these models all assume 
independent failures of chunks. As we have shown, this 
assumption potentially leads to overestimation of data 
availability by many orders of magnitude. The authors 
of [20] build a tool to optimize the disaster recovery ac- 
cording to availability requirements, with similar goals 
as our analysis of multi-cell replication. However, they 
do not focus on studying the effect of failure characteris- 
tics and data redundancy options. 

Node availability in our environment is different from 
previous work, such as [7, 18, 23], because we study a 
large system that is tightly coupled in a single administra- 
tive domain. These studies focus on measuring and pre- 
dicting availability of individual desktop machines from 
many, potentially untrusted, domains. Other authors 
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[11] studied data replication in face of failures, though 
without considering availability of Reed-Solomon en- 
codings or multi-cell replication. 


10 Conclusions 


We have presented data from Google’s clusters that char- 
acterize the sources of failures contributing to unavail- 
ability. We find that correlation among node failures 
dwarfs all other contributions to unavailability in our pro- 
duction environment. 

In particular, though disks failures can result in per- 
manent data loss, the multitude of transitory node fail- 
ures account for most unavailability. We present a simple 
time-window-based method to group failure events into 
failure bursts which, despite its simplicity, successfully 
identifies bursts with a common cause. We develop ana- 
lytical models to reason about past and future availability 
in our cells, including the effects of different choices of 
replication, data placement and system parameters. 

Inside Google, the analysis described in this paper has 
provided a picture of data availability at a finer granu- 
larity than previously measured. Using this framework, 
we provide feedback and recommendations to the de- 
velopment and operational engineering teams on differ- 
ent replication and encoding schemes, and the primary 
causes of data unavailability in our existing cells. Spe- 
cific examples include: 


e Determining the acceptable rate of successful trans- 
fers to battery power for individual machines upon 
a power outage. 


e Focusing on reducing reboot times, because 
planned kernel upgrades are a major source of cor- 
related failures. 


e Moving towards a dynamic delay before initiating 
recoveries, based on failure classification and recent 
history of failures in the cell. 


Such analysis complements the intuition of the design- 
ers and operators of these complex distributed systems. 
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Abstract 


Managing data and computation is at the heart of data- 
center computing. Manual management of data can lead 
to data loss, wasteful consumption of storage, and labo- 
rious bookkeeping. Lack of proper management of com- 
putation can result in lost opportunities to share common 
computations across multiple jobs or to compute results 
incrementally. 

Nectar is a system designed to address the aforemen- 
tioned problems. It automates and unifies the manage- 
ment of data and computation within a datacenter. In 
Nectar, data and computation are treated interchange- 
ably by associating data with its computation. De- 
rived datasets, which are the results of computations, are 
uniquely identified by the programs that produce them, 
and together with their programs, are automatically man- 
aged by a datacenter wide caching service. Any derived 
dataset can be transparently regenerated by re-executing 
its program, and any computation can be transparently 
avoided by using previously cached results. This en- 
ables us to greatly improve datacenter management and 
resource utilization: obsolete or infrequently used de- 
rived datasets are automatically garbage collected, and 
shared common computations are computed only once 
and reused by others. 

This paper describes the design and implementation of 
Nectar, and reports on our evaluation of the system using 
analytic studies of logs from several production clusters 
and an actual deployment on a 240-node cluster. 


1 Introduction 


Recent advances in distributed execution engines (Map- 
Reduce [7], Dryad [18], and Hadoop [12]) and high-level 
language support (Sawzall [25], Pig [24], BOOM [3], 
HIVE [17], SCOPE [6], DryadLINQ [29]) have greatly 


*L. Ravindranath is affiliated with the Massachusetts Institute of 
Technology and was a summer intern on the Nectar project. 


simplified the development of large-scale, data-intensive, 
distributed applications. However, major challenges still 
remain in realizing the full potential of data-intensive 
distributed computing within datacenters. In current 
practice, a large fraction of the computations in a dat- 
acenter is redundant and many datasets are obsolete or 
seldom used, wasting vast amounts of resources in a dat- 
acenter. 

As one example, we quantified the wasted storage in 
our 240-node experimental Dryad/DryadLINQ cluster. 
We crawled this cluster and noted the last access time 
for each data file. We discovered that around 50% of the 
files was not accessed in the last 250 days. 

As another example, we examined the execution statis- 
tics of 25 production clusters running data-parallel ap- 
plications. We estimated that, on one such cluster, over 
7000 hours of redundant computation can be eliminated 
per day by caching intermediate results. (This is approx- 
imately equivalent to shutting off 300 machines daily.) 
Cumulatively, over all clusters, this figure is over 35,000 
hours per day. 

Many of the resource issues in a datacenter arise due 
to lack of efficient management of either data or compu- 
tation, or both. This paper describes Nectar: a system 
that manages the execution environment of a datacenter 
and is designed to address these problems. 

A key feature of Nectar is that it treats data and com- 
putation in a datacenter interchangeably in the following 
sense. Data that has not been accessed for a long pe- 
riod may be removed from the datacenter and substituted 
by the computation that produced it. Should the data be 
needed in the future, the computation is rerun. Similarly, 
instead of executing a user’s program, Nectar can par- 
tially or fully substitute the results of that computation 
with data already present in the datacenter. Nectar relies 
on certain properties of the programming environment 
in the datacenter to enable this interchange of data and 
computation. 

Computations running on a Nectar-managed datacen- 
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ter are specified as programs in LINQ [20]. LINQ com- 
prises a set of operators to manipulate datasets of NET 
objects. These operators are integrated into high level 
.NET programming languages (e.g., C#), giving pro- 
grammers direct access to .NET libraries as well tradi- 
tional language constructs such as loops, classes, and 
modules. The datasets manipulated by LINQ can contain 
objects of an arbitrary .NET type, making it easy to com- 
pute with complex data such as vectors, matrices, and 
images. All of these operators are functional: they trans- 
form input datasets to new output datasets. This property 
helps Nectar reason about programs to detect program 
and data dependencies. LINQ is a very expressive and 
flexible language, e.g., the MapReduce class of compu- 
tations can be trivially expressed in LINQ. 

Data stored in a Nectar-managed datacenter are di- 
vided into one of two classes: primary or derived. Pri- 
mary datasets are created once and accessed many times. 
Derived datasets are the results produced by computa- 
tions running on primary and other derived datasets. Ex- 
amples of typical primary datasets in our datacenters 
are click and query logs. Examples of typical derived 
datasets are the results of thousands of computations per- 
formed on those click and query logs. 

In a Nectar-managed datacenter, all access to a derived 
dataset is mediated by Nectar. At the lowest level of the 
system, a derived dataset is referenced by the LINQ pro- 
gram fragment or expression that produced it. Program- 
mers refer to derived datasets with simple pathnames that 
contain a simple indirection (much like a UNIX symbolic 
link) to the actual LINQ programs that produce them. By 
maintaining this mapping between a derived dataset and 
the program that produced it, Nectar can reproduce any 
derived dataset after it is automatically deleted. Primary 
datasets are referenced by conventional pathnames, and 
are not automatically deleted. 

A Nectar-managed datacenter offers the following ad- 
vantages. 


1. Efficient space utilization. Nectar implements a 
cache server that manages the storage, retrieval, and 
eviction of the results of all computations (i.e., de- 
rived datasets). As well, Nectar retains the de- 
scription of the computation that produced a de- 
rived dataset. Since programmers do not directly 
manage datasets, Nectar has considerable latitude 
in optimizing space: it can remove unused or in- 
frequently used derived datasets and recreate them 
on demand by rerunning the computation. This is a 
classic trade-off of storage and computation. 


2. Reuse of shared sub-computations. Many appli- 
cations running in the same datacenter share com- 
mon sub-computations. Since Nectar automatically 
caches the results of sub-computations, they will be 
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computed only once and reused by others. This sig- 
nificantly reduces redundant computations, result- 
ing in better resource utilization. 


3. Incremental computations. Many datacenter ap- 
plications repeat the same computation on a slid- 
ing window of an incrementally augmented dataset. 
Again, caching in Nectar enables us to reuse the re- 
sults of old data and only compute incrementally for 
the newly arriving data. 


4. Ease of content management. With derived datasets 
uniquely named by LINQ expressions, and auto- 
matically managed by Nectar, there is little need for 
developers to manage their data manually. In par- 
ticular, they do not have to be concerned about re- 
membering the location of the data. Executing the 
LINQ expression that produced the data is sufficient 
to access the data, and incurs negligible overhead in 
almost all cases because of caching. This is a sig- 
nificant advantage because most datacenter applica- 
tions consume a large amount of data from diverse 
locations and keeping track of the requisite filepath 
information is often a source of bugs. 


Our experiments show that Nectar, on average, could 
improve space utilization by at least 50%. As well, in- 
cremental and sub-computations managed by Nectar pro- 
vide an average speed up of 30% for the programs run- 
ning on our clusters. We provide a detailed quantitative 
evaluation of the first three benefits in Section 4. We 
have not done a detailed user study to quantify the fourth 
benefit, but the experience from our initial deployment 
suggests that there is evidence to support the claim. 

Some of the techniques we used such as dividing 
datasets into primary and derived and reusing the re- 
sults of previous computations via caching are reminis- 
cent of earlier work in version management systems [15], 
incremental database maintenance [5], and functional 
caching [16, 27]. Section 5 provides a more detailed 
analysis of our work in relation to prior research. 

This paper makes the following contributions to the 
literature: 


e We propose a novel and promising approach that 
automates and unifies the management of data and 
computation in a datacenter, leading to substantial 
improvements in datacenter resource utilization. 


e We present the design and implementation of our 
system, including a sophisticated program rewriter 
and static program dependency analyzer. 


e We present a systematic analysis of the performance 
of our system from a real deployment on 240-nodes 
as well as analytical measurements. 
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Figure 1: Nectar architecture. The system consists of a 
client-side library and cluster-wide services. Nectar re- 
lies on the services of DryadLINQ/Dryad and TidyFS, a 
distributed file system. 


The rest of this paper is organized as follows. Sec- 
tion 2 provides a high-level overview of the Nectar sys- 
tem. Section 3 describes the implementation of the sys- 
tem. Section 4 evaluates the system using real work- 
loads. Section 5 covers related work and Section 6 dis- 
cusses future work and concludes the paper. 


2 System Design Overview 


The overall Nectar architecture is shown in Figure 1. 
Nectar consists of a client-side component that runs on 
the programmer’s desktop, and two services running in 
the datacenter. 

Nectar is completely transparent to user programs and 
works as follows. It takes a DryadLINQ program as in- 
put, and consults the cache service to rewrite it to an 
equivalent, more efficient program. Nectar then hands 
the resulting program to DryadLINQ which further com- 
piles it into a Dryad computation running in the clus- 
ter. At run time, a Dryad job is a directed acyclic graph 
where vertices are programs and edges represent data 
channels. Vertices communicate with each other through 
data channels. The input and output of a DryadLINQ 
program are expected to be streams. A stream consists of 
an ordered sequence of extents, each storing a sequence 
of object of some data type. We use an in-house fault- 
tolerant, distributed file system called TidyFS to store 
streams. 

Nectar makes certain assumptions about the underly- 
ing storage system. We require that streams be append- 
only, meaning that new contents are added by either ap- 
pending to the last extent or adding a new extent. The 
metadata of a stream contains Rabin fingerprints [4] of 
the entire stream and its extents. 

Nectar maintains and manages two namespaces in 


TidyFS. The program store keeps all DryadLINQ pro- 
grams that have ever executed successfully. The data 
store is used to store all derived streams generated by 
DryadLINQ programs. The Nectar cache server pro- 
vides cache hits to the program rewriter on the client 
side. It also implements a replacement policy that deletes 
cache entries of least value. Any stream in the data 
store that is not referenced by any cache entry is deemed 
to be garbage and deleted permanently by the Nectar 
garbage collector. Programs in the program store are 
never deleted and are used to recreate a deleted derived 
stream if it is needed in the future. 


A simple example of a program is shown in Ex- 
ample 2.1. The program groups identical words in a 
large document into groups and applies an arbitrary user- 
defined function Reduce to each group. This is a typ- 
ical MapReduce program. We will use it as a running 
example to describe the workings of Nectar. TidyFS, 
Dryad, and DryadLINQ are described in detail else- 
where [8, 18, 29]. We only discuss them briefly below 
to illustrate their relationships to our system. 

In the example, we assume that the input D is a large 
(replicated) dataset partitioned as D,, D2 ... D, in the 
TidyFS distributed file system and it consists of lines of 
text. SelectMany is a LINQ operator, which first pro- 
duces a single list of output records for each input record 
and then “flattens” the lists of output records into a sin- 
gle list. In our example, the program applies the function 
x => x.Split(’ ’) to each line in D to produce 
the list of words in D. 

The program then uses the GroupBy operator to 
group the words into a list of groups, putting the same 
words into a single group. GroupBy takes a key-selector 
function as the argument, which when applied to an 
input record returns a collating “key” for that record. 
Group By applies the key-selector function to each input 
record and collates the input into a list of groups (multi- 
sets), one group for all the records with the same key. 

The last line of the program applies a transforma- 
tion Reduce to each group. Select is a simpler ver- 
sion of SelectMany. Unlike the latter, Select pro- 
duces a single output record (determined by the function 
Reduce) for each input record. 


Example 2.1 A typical MapReduce job expressed in 
LINQ. (x => x.Split(’ ’)) produces a list of 
blank-separated words; (x => x) produces a key for 
each input; Reduce is an arbitrary user supplied func- 
tion that is applied to each input. 





words = D.SelectMany(x => x.Split(’ ’)); 
groups = words.GroupBy (x => x); 


result = groups.Select (x => Reduce (x) ); 
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Figure 2: Execution graph produced by Nectar given 
the input LINQ program in Example 2.1. The nodes 
named SM+D executes Se lect Many and distributes the 
results. GB+S executes GroupBy and Select. 


When the program in Example 2.1 is run for the first 
time, Nectar, by invoking DryadLINQ, produces the dis- 
tributed execution graph shown in Figure 2, which is then 
handed to Dryad for execution. (For simplicity of exposi- 
tion, we assume for now that there are no cache hits when 
Nectar rewrites the program.) The SM+D vertex performs 
the SelectMany and distributes the results by parti- 
tioning them on a hash of each word. This ensures that 
identical words are destined to the same GB+S vertex 
in the graph. The GB+S vertex performs the GroupBy 
and Select operations together. The AE vertex adds a 
cache entry for the final result of the program. Notice 
that the derived stream created for the cache entry shares 
the same set of extents with the result of the computa- 
tion. So, there is no additional cost of storage space. As 
a rule, Nectar always creates a cache entry for the final 
result of a computation. 





2.1 Client-Side Library 


On the client side, Nectar takes advantage of cached re- 
sults from the cache to rewrite a program P to an equiv- 
alent, more efficient program P’. It automatically inserts 
AddEntry calls at appropriate places in the program so 
new cache entries can be created when P’ is executed. 
The AddEntry calls are compiled into Dryad vertices that 
create new cache entries at runtime. We summarize the 
two main client-side components below. 


Cache Key Calculation 
A computation is uniquely identified by its program 
and inputs. We therefore use the Rabin fingerprint of 
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the program and the input datasets as the cache key for 
a computation. The input datasets are stored in TidyFS 
and their fingerprints are calculated based on the actual 
stream contents. Nectar calculates the fingerprint of the 
program and combines it with the fingerprints of the in- 
put datasets to form the cache key. 

The fingerprint of a DryadLINQ program must be able 
to detect any changes to the code the program depends 
on. However, the fingerprint should not change when 
code the program does not depend on changes. This 
is crucial for the correctness and practicality of Nectar. 
(Fingerprints can collide but the probability of a colli- 
sion can be made vanishingly small by choosing long 
enough fingerprints.) We implement a static dependency 
analyzer to compute the transitive closure of all the code 
that can be reached from the program. The fingerprint is 
then formed using all reachable code. Of course, our an- 
alyzer only produces an over-approximation of the true 
dependency. 


Rewriter 

Nectar rewrites user programs to use cached results 
where possible. We might encounter different entries 
in the cache server with different sub-expressions and/or 
partial input datasets. So there are typically multiple al- 
ternatives to choose from in rewriting a DryadLINQ pro- 
gram. The rewriter uses a cost estimator to choose the 
best one from multiple alternatives (as discussed in Sec- 
tion 3.1). 

Nectar supports the following two rewriting scenarios 
that arise very commonly in practice. 


Common sub-expressions. Internally, a DryadLINQ 
program is represented as a LINQ expression tree. Nec- 
tar treats all prefix sub-expressions of the expression tree 
as candidates for caching and looks up in the cache for 
possible cache hits for every prefix sub-expression. 


Incremental computations. Incremental computation 
on datasets is a common occurrence in data intensive 
computing. Typically, a user has run a program P on in- 
put D. Now, he is about to compute P on input D + D’, 
the concatenation of D and D’. The Nectar rewriter finds 
a new operator to combine the results of computing on 
the old input and the new input separately. See Sec- 
tion 2.3 for an example. 

A special case of incremental computation that occurs 
in datacenters is a computation that executes on a sliding 
window of data. That is, the same program is repeatedly 
run on the following sequence of inputs: 


Input, = dy + dy a dn, 
Input = dy + dz sce FE dn+15 
Inputs = dz + da + saa + dn+2; 
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Here d; is a dataset that (potentially) consists of mul- 
tiple extents distributed over many computers. So suc- 
cessive inputs to the program (Jnput,;) are datasets with 
some old extents removed from the head of the previous 
input and new extents appended to the tail of it. Nec- 
tar generates cache entries for each individual dataset d;, 
and can use them in subsequent computations. 

In the real world, a program may belong to a combina- 
tion of the categories above. For example, an application 
that analyzes logs of the past seven days is rewritten as 
an incremental computation by Nectar, but Nectar may 
use sub-expression results of log preprocessing on each 
day from other applications. 


2.2 Datacenter-Wide Service 


The datacenter-wide service in Nectar comprises two 
separate components: the cache service and the garbage 
collection service. The actual datasets are stored in 
the distributed storage system and the datacenter-wide 
services manipulate the actual datasets by maintaining 
pointers to them. 


Cache Service 

Nectar implements a distributed datacenter-wide 
cache service for bookkeeping information about Dryad- 
LINQ programs and the location of their results. The 
cache service has two main functionalities: (1) serving 
the cache lookup requests by the Nectar rewriter; and (2) 
managing derived datasets by deleting the cache entries 
of least value. 

Programs of all successful computations are uploaded 
to a dedicated program store in the cluster. Thus, the 
service has the necessary information about cached re- 
sults, meaning that it has a recipe to recreate any de- 
rived dataset in the datacenter. When a derived dataset 
is deleted but needed in the future, Nectar recreates it us- 
ing the program that produced it. If the inputs to that 
program have themselves been deleted, it backtracks re- 
cursively till it hits the immutable primary datasets or 
cached derived datasets. Because of this ability to recre- 
ate datasets, the cache server can make informed deci- 
sions to implement a cache replacement policy, keeping 
the cached results that yield the most hits and deleting the 
cached results of less value when storage space is low. 


Garbage Collector 

The Nectar garbage collector operates transparently to 
the users of the cluster. Its main job is to identify datasets 
unreachable from any cache entry and delete them. We 
use a standard mark-and-sweep collector. Actual content 
deletion is done in the background without interfering 
with the concurrent activities of the cache server and job 
executions. Section 3.2 has additional detail. 











Figure 3: Execution graph produced by Nectar on the 
program in Example 2.1 after it elects to cache the results 
of computations. Notice that the GroupBy and Select 
are now encapsulated in separate nodes. The new AE 
vertex creates a cache entry for the output of GroupBy. 


2.3 Example: Program Rewriting 


Let us look at the interesting case of incremental compu- 
tation by continuing Example 2.1. 

After the program has been executed a sufficient num- 
ber of times, Nectar may elect to cache results from some 
of its subcomputations based on the usage information 
returned to it from the cache service. So subsequent runs 
of the program may cause Nectar to create different exe- 
cution graphs than those created previously for the same 
program. Figure 3 shows the new execution graph when 
Nectar chooses to cache the result of GroupBy (c.f. Fig- 
ure 2). It breaks the pipeline of GroupBy and Select 
and creates an additional AddEntry vertex (denoted by 
AE) to cache the result of GroupBy. During the exe- 
cution, when the GB stage completes, the AE vertex will 
run, creating a new TidyFS stream and a cache entry for 
the result of GroupBy. We denote the stream by Gp, 
partitioned as Gip,, Gp,, .. Gp,. 

Subsequently, assume the program in Example 2.1 is 
run on input (D + X), where X is a new dataset parti- 
tioned as Xj, X2,.. X;. The Nectar rewriter would get a 
cache hit on Gp. So it only needs to perform GroupBy 
on X and merge with G'p to form new groups. Figure 4 
shows the new execution graph created by Nectar. 

There are some subtleties involved in the rewriting 
process. Nectar first determines that the number of par- 
titions (n) of Gp. It then computes GroupBy on X the 
same way as Gp, generating n partitions with the same 
distribution scheme using the identical hash function as 
was used previously (see Figures 2 and 3). That is, the 
rewritten execution graph has k; SM+D vertices, but n GB 
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vertices. The MG vertex then performs a pairwise merge 
of the output GB with the cached result Gp. The result 
of MG is again cached for future uses, because Nectar 
notices the pattern of incremental computation and ex- 
pects that the same computation will happen on datasets 
of form Gp+x+y in the future. 





Figure 4: The execution graph produced by Nectar on 
the program in Example 2.1 on the dataset D + X. The 
dataset X consists of / partitions. The MG vertex merges 
groups with the same key. Both the results of GB and MG 
are cached. There are k SM+D vertices, but n GB, MG, 
and S vertices. Gp1,...,Gpn are the partitions of the 
cached result. 


Similar to MapReduce’s combiner optimization [7] 
and Data Cube computation [10], DryadLINQ can de- 
compose Reduce into the composition of two associa- 
tive and commutative functions if Reduce is determined 
to be decomposable. We handle this by first applying the 
decomposition as in [28] and then the caching and rewrit- 
ing as described above. 


3 Implementation Details 


We now present the implementation details of the two 
most important aspects of Nectar: Section 3.1 describes 
computation caching and Section 3.2 describes the auto- 
matic management of derived datasets. 


3.1 Caching Computations 


Nectar rewrites a DryadLINQ program to an equivalent 
but more efficient one using cached results. This gen- 
erally involves: 1) identifying all sub-expressions of the 
expression, 2) probing the cache server for all cache hits 
for the sub-expressions, 3) using the cache hits to rewrite 
the expression into a set of equivalent expressions, and 4) 
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choosing one that gives us the maximum benefit based on 
some cost estimation. 


Cache and Programs 

A cache entry records the result of executing a pro- 
gram on some given input. (Recall that a program may 
have more than one input depending on its arity.) The 
entry is of the form: 


(F Ppp, F'Pp, Result, Statistics, F PList) 


Here, F’Ppp is the combined fingerprint of the pro- 
gram and its input datasets, Pp is the fingerprint of the 
program only, Result is the location of the output, and 
Statistics contains execution and usage information of 
this cache entry. The last field F'PList contains a list 
of fingerprint pairs each representing the fingerprints of 
the first and last extents of an input dataset. We have one 
fingerprint pair for every input of the program. As we 
shall see later, it is used by the rewriter to search amongst 
cache hits efficiently. Since the same program could have 
been executed on different occasions on different inputs, 
there can be multiple cache entries with the same Ff’ Pp. 

We use F'Ppp as the primary key. So our caching 
is sound only if F’'Ppp can uniquely determine the re- 
sult of the computation. The fingerprint of the inputs is 
based on the actual content of the datasets. The finger- 
print of a dataset is formed by combining the fingerprints 
of its extents. For a large dataset, the fingerprints of its 
extents are efficiently computed in parallel by the data- 
center computers. 

The computation of the program fingerprint is tricky, 
as the program may contain user-defined functions that 
call into library code. We implemented a static depen- 
dency analyzer to capture all dependencies of an ex- 
pression. At the time a DryadLINQ program is in- 
voked, DryadLINQ knows all the dynamic linked li- 
braries (DLLs) it depends on. We divide them into two 
categories: system and application. We assume system 
DLLs are available and identical on all cluster machines 
and therefore are not included in the dependency. For 
an application DLL that is written in native code (e.g., 
C or assembler), we include the entire DLL as a depen- 
dency. For soundness, we assume that there are no call- 
backs from native to managed code. For an application 
DLL that is in managed code (e.g., C#), our analyzer tra- 
verses the call graph to compute all the code reachable 
from the initial expression. 

The analyzer works at the bytecode level. It uses stan- 
dard .NET reflection to get the body of a method, finds 
all the possible methods that can be called in the body, 
and traverses those methods recursively. When a virtual 
method call is encountered, we include all the possible 
call sites. While our analysis is certainly a conservative 
approximation of the true dependency, it is reasonably 
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precise and works well in practice. Since dynamic code 
generation could introduce unsoundness into the analy- 
sis, it is forbidden in managed application DLLs, and is 
statically enforced by the analyzer. 

The statistics information kept in the cache entry is 
used by the rewriter to find an optimal execution plan. It 
is also used to implement the cache insertion and eviction 
policy. It contains information such as the cumulative ex- 
ecution time, the number of hits on this entry, and the last 
access time. The cumulative execution time is defined as 
the sum of the execution time of all upstream Dryad ver- 
tices of the current execution stage. It is computed at the 
time of the cache entry insertion using the execution logs 
generated by Dryad. 

The cache server supports a simple client interface. 
The important operations include: (1) Lookup (fp) 
finds and returns the cache entry that has fp as the pri- 
mary key (F' Ppp); (2) Inquire (fp) returns all cache 
entries that have fp as their F’'Pp; and (3) AddEntry 
inserts a new cache entry. We will see their uses in the 
following sections. 





The Rewriting Algorithm 

Having explained the structure and interface of the 
cache, let us now look at how Nectar rewrites a program. 

For a given expression, we may get cache hits on 
any possible sub-expression and subset of the input 
dataset, and considering all of them in the rewriting 
is not tractable. We therefore only consider cache 
hits on prefix sub-expressions on segments of the input 
dataset. More concretely, consider a simple example 
D.Where (P) .Select (F). The Where operator ap- 
plies a filter to the input dataset D, and the Select op- 
erator applies a transformation to each item in its input. 
We will only consider cache hits for the sub-expressions 
S.Where(P) andS.Where (P) .Select (F) where 
S is a subsequence of extents in D. 

Our rewriting algorithm is a simple recursive proce- 
dure. We start from the largest prefix sub-expression, the 
entire expression. Below is an outline of the algorithm. 
For simplicity of exposition, we assume that the expres- 
sions have only one input. 


Step 1. For the current sub-expression FE’, we probe the 
cache server to obtain all the possible hits on it. There 
can be multiple hits on different subsequences of the in- 
put D. Let us denote the set of hits by H. Note that each 
hit also gives us its saving in terms of cumulative exe- 
cution time. If there is a hit on the entire input D, we use 
that hit and terminate because it gives us the most sav- 
ings in terms of cumulative execution time. Otherwise 
we execute Steps 2-4. 


Step 2. We compute the best execution plan for F using 
hits on its smaller prefixes. To do that, we first compute 
the best execution plan for each immediate successor 


prefix of & by calling our procedure recursively, and 
then combine them to form a single plan for &. Let us 
denote this plan by (P,,C1) where C; is its saving in 
terms of cumulative execution time. 


Step 3. For the H hits on & (from Step 1), we choose 
a subset of them such that (a) they operate on disjoint 
subsequence of D, and (b) they give us the most saving 
in terms of cumulative execution time. This boils down 
to the well-known problem of computing the maxi- 
mum independent sets of an interval graph, which has 
a known efficient solution using dynamic programming 
techniques [9]. We use this subset to form another ex- 
ecution plan for EF on D. Let us denote this plan by 
(P2, C2). 


Step 4. The final execution plan is the one from P,; and 
P, that gives us more saving. 


In Step 1, the rewriter calls Inquire to compute H. 
As described before, Inquire returns all the possible 
cache hits of the program with different inputs. A useful 
hit means that its input dataset is identical to a subse- 
quence of extents of D. A brute force search is inefficient 
and requires to check every subsequence. As an opti- 
mization, we store in the cache entry the fingerprints of 
the first and last extents of the input dataset. With that 
information, we can compute #7 in linear time. 

Intuitively, in rewriting a program P on incremental 
data Nectar tries to derive a combining operator C’ such 
that P(D+D’) = C(P(D), D’), where C combines the 
results of P on the datasets D and D’. Nectar supports 
all the LINQ operators DryadLINQ supports. 

The combining functions for some LINQ opera- 
tors require the parallel merging of multiple streams, 
and are not directly supported by DryadLINQ. We 
introduced three combining functions: MergeSort, 
HashMergeGroups, and SortMergeGroups, 
which are straightforward to implement using Dryad- 
LINQ’s Apply operator [29]. MergeSort takes 
multiple sorted input streams, and merge sorts them. 
HashMergeGroups and SortMergeGroups take 
multiple input streams and merge groups of the same 
key from the input streams. If all the input streams are 
sorted, Nectar chooses to use SortMergeGroups, 
which is streaming and more efficient. Otherwise, 
Nectar uses HashMergeGroups. The MG vertex in 
Figure 4 is an example of this group merge. 

The technique of reusing materialized views in 
database systems addresses a similar problem. One im- 
portant difference is that a database typically does not 
maintain views for multiple versions of a table, which 
would prevent it from reusing results computed on old 
incarnations of the table. For example, suppose we have 
a materialized view V on D. When D is changed to 
D + Dyj, the view is also updated to V’. So for any fu- 
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ture computation on D + Dg, V is no longer available 
for use. In contrast, Nectar maintains both V and V’, and 
automatically tries to reuse them for any computation, in 
particular the ones on D + Do. 


Cache Insertion Policy 


We consider every prefix sub-expression of an expres- 
sion to be a candidate for caching. Adding a cache entry 
incurs additional cost if the entry is not useful. It requires 
us to store the result of the computation on disk (instead 
of possibly pipelining the result to the next stage), incur- 
ring the additional disk IO and space overhead. Obvi- 
ously it is not practical to cache everything. Nectar im- 
plements a simple strategy to determine what to cache. 


First of all, Nectar always creates a cache entry for 
the final result of a computation as we get it for free: it 
does not involve a break of the computation pipeline and 
incurs no extra IO and space overhead. 


For sub-expression candidates, we wish to cache them 
only when they are predicted to be useful in the future. 
However, determining the potential usefulness of a cache 
entry is generally difficult. So we base our cache inser- 
tion policy on heuristics. The caching decision is made 
in the following two phases. 


First, when the rewriter rewrites an expression, it de- 
cides on the places in the expression to insert AddEntry 
calls. This is done using the usage statistics maintained 
by the cache server. The cache server keeps statistics for 
a sub-expression based on request history from clients. 
In particular, it records the number of times it has been 
looked up. On response to a cache lookup, this number 
is included in the return value. We insert an AddEntry 
call for an expression only when the number of lookups 
on it exceeds a predefined threshold. 


Second, the decision made by the rewriter may still be 
wrong because of the lack of information about the sav- 
ing of the computation. Information such as execution 
time and disk consumption are only available at run time. 
So the final insertion decision is made based on the run- 
time information of the execution of the sub-expression. 
Currently, we use a simple benefit function that is propor- 
tional to the execution time and inversely proportional to 
storage overhead. We add the cache entry when the ben- 
efit exceeds a threshold. 


We also make our cache insertion policy adaptive to 
storage space pressure. When there is no pressure, we 
choose to cache more aggressively as long as it saves 
machine time. This strategy could increase the useless 
cache entries in the cache. But it is not a problem because 
it is addressed by Nectar’s garbage collection, discussed 
further below. 
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3.2 Managing Derived Data 


Derived datasets can take up a significant amount of stor- 
age space in a datacenter, and a large portion of it could 
be unused or seldom used. Nectar keeps track of the us- 
age statistics of all derived datasets and deletes the ones 
of the least value. Recall that Nectar permanently stores 
the program of every derived dataset so that a deleted de- 
rived can be recreated by re-running its program. 


Data Store for Derived Data 

As mentioned before, Nectar stores all derived 
datasets in a data store inside a distributed, fault-tolerant 
file system. The actual location of a derived dataset is 
completely opaque to programmers. Accessing an ex- 
isting derived dataset must go through the cache server. 
We expose a standard file interface with one important 
restriction: New derived datasets can only be created as 
results of computations. 


P = q.ToTable(“lenin/foo.pt”) 
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Figure 5: The creation of a derived dataset. The actual 
dataset is stored in the Nectar data store. The user file 
contains only the primary key of the cache entry associ- 
ated with the derived. 


Our scheme to achieve this is straightforward. Fig- 
ure 5 shows the flow of creating a derived dataset by a 
computation and the relationship between the user file 
and the actual derived dataset. In the figure, P is a user 
program that writes its output to lenin/foo.pt. Af- 
ter applying transformations by Nectar and DryadLINQ, 
it is executed in the datacenter by Dryad. When the ex- 
ecution succeeds, the actual derived dataset is stored in 
the data store with a unique name generated by Nectar. A 
cache entry is created with the fingerprint of the program 
(FP (P)) as the primary key and the unique name as a 
field. The content of lenin/foo.pt just contains the 
primary key of the cache entry. 

To access lenin/foo.pt, Nectar simply uses 
FP (P) to look up the cache to obtain the location of 
the actual derived dataset (A31E4 .pt). The fact that all 
accesses go through the cache server allows us to keep 
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track of the usage history of every derived dataset and 
to implement automatic garbage collection for derived 
datasets based on their usage history. 


Garbage Collection 

When the available disk space falls below a thresh- 
old, the system automatically deletes derived datasets 
that are considered to be least useful in the future. This 
is achieved by a combination of the Nectar cache server 
and garbage collector. 

A derived dataset is protected from garbage collection 
if it is referenced by any cache entry. So, the first step 
is to evict from the cache, entries that the cache server 
determines to have the least value. 

The cache server uses information stored in the cache 
entries to do a cost-benefit analysis to determine the use- 
fulness of the entries. For each cache entry, we keep 
track of the size of the resulting derived dataset (5), the 
elapsed time since it was last used (AT), the number of 
times (JV) it has been used and the cumulative machine 
time (M) of the computation that created it. The cache 
server uses these values to compute the cost-to-benefit 
ratio 


CBRatio = (S x AT)/(N x M) 


of each cache entry and deletes entries that have the 
largest ratios so that the cumulative space saving reaches 
a predefined threshold. 

Freshly created cache entries do not contain informa- 
tion for us to compute a useful cost/benefit ratio. To give 
them a chance to demonstrate their usefulness, we ex- 
clude them from deletion by using a lease on each newly 
created cache entry. 

The entire cache eviction operation is done in the 
background, concurrently with any other cache server 
operations. When the cache server completes its evic- 
tion, the garbage collector deletes all derived datasets 
not protected by a cache entry using a simple mark-and- 
sweep algorithm. Again, this is done in the background, 
concurrently with any other activities in the system. 

Other operations can run concurrently with the 
garbage collector and create new cache entries and de- 
rived datasets. Derived datasets pointed to by cache en- 
tries (freshly created or otherwise) are not candidates for 
garbage collection. Notice however that freshly created 
derived datasets, which due to concurrency may not yet 
have a cache entry, also need to protected from garbage 
collection. We do this with a lease on the dataset. 

With these leases in place, garbage collection is quite 
straightforward. We first compute the set of all derived 
datasets (ignoring the ones with unexpired leases) in our 
data store, exclude from it the set of all derived datasets 
referenced by cache entries, and treat the remaining as 
garbage. 


Our system could mistakenly delete datasets that are 
subsequently requested, but these can be recreated by re- 
executing the appropriate program(s) from the program 
store. Programs are stored in binary form in the pro- 
gram store. A program is a complete Dryad job that can 
be submitted to the datacenter for execution. In particu- 
lar, it includes the execution plan and all the application 
DLLs. We exclude all system DLLs, assuming that they 
are available on the datacenter machines. For a typical 
datacenter that runs 1000 jobs daily, our experience sug- 
gests it would take less than 1TB to store one year’s pro- 
gram (excluding system DLLs) in uncompressed form. 
With compression, it should take up roughly a few hun- 
dreds of gigabytes of disk space, which is negligible even 
for a small datacenter. 


4 Experimental Evaluation 


We evaluate Nectar running on our 240-node research 
cluster as well as present analytic results of execution 
logs from 25 large production clusters that run jobs sim- 
ilar to those on our research cluster. We first present our 
analytic results. 


4.1 Production Clusters 


We use logs from 25 different clusters to evaluate the 
usefulness of Nectar. The logs consist of detailed execu- 
tion statistics for 33182 jobs in these clusters for a recent 
3-month period. For each job, the log has the source pro- 
gram and execution statistics such as computation time, 
bytes read and written and the actual time taken for ev- 
ery stage in a job. The log also gives information on the 
submission time, start time, end time, user information, 
and job status. 

Programs from the production clusters work with mas- 
sive datasets such as click logs and search logs. Programs 
are written in a language similar to DryadLINQ in that 
each program is a sequence of SQL-like queries [6]. A 
program is compiled into an expression tree with various 
stages and modeled as a DAG with vertices representing 
processes and edges representing data flows. The DAGs 
are executed on a Dryad cluster, just as in our Nectar 
managed cluster. Input data in these clusters is stored as 
append-only streams. 


Benefits from Caching 

We parse the execution logs to recreate a set of DAGs, 
one for each job. The root of the DAG represents the 
input to the job and a path through the DAG starting at 
the root represents a partial (i.e., a sub-) computation of 
the job. Identical DAGs from different jobs represent an 
opportunity to save part of the computation time of a later 
job by caching results from the earlier ones. We simulate 
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the effect of Nectar’s caching on these DAGs to estimate 
cache hits. 

Our results show that on average across all clusters, 
more than 35% of the jobs could benefit from caching. 
More than 30% of programs in 18 out of 25 clusters 
could have at least one cache hit, and there were even 
some clusters where 65% of programs could have cache 
hits. 

The log contains detailed computation time informa- 
tion for each node in the DAG for a job. When there is 
a cache hit on a sub-computation of a job, we can there- 
fore calculate the time saved by the cache hit. We show 
the result of this analysis in two different ways: Figure 6 
shows the percentage of computing time saved and Ta- 
ble 1 shows the minimum number of hours of computa- 
tion saved in each cluster. 

Figure 6 shows that significant percentage of computa- 
tion time can be saved in each cluster with Nectar. Most 
clusters can save a minimum of 20% to 40% of com- 
putation time and in some clusters the savings are up to 
50%. Also, as an example, Table 1 shows a minimum of 
7143 hours of computation per day can be saved using 
Nectar in Cluster C5. This is roughly equivalent to say- 
ing that about 300 machines in that cluster were doing 
wasteful computations all day that caching could elimi- 
nate. Across all 25 clusters, 35078 hours of computation 
per day can be saved, which is roughly equivalent to sav- 


ing 1461 machines. 
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Figure 6: Fraction of compute time saved in each cluster 
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Ease of Program Development 

Our analysis of the caching accounted for both sub- 
computation as well as incremental/sliding window hits. 
We noticed that the percentage of sliding window hits in 
some production clusters was minimal (under 5%). We 
investigated this further and noticed that many program- 
mers explicitly structured their programs so that they can 
reuse a previous computation. This somewhat artificial 
structure makes their programs cumbersome, which can 
be alleviated by using Nectar. 
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Computation Computation 
Cluster | Time Saved | Cluster | Time Saved 
(hours/day) (hours/day) 

Cl 3898 C14 753 
C2 2276 C15 755 
C3 977 C16 2259 
C4 1345 C17 3385 
C5 7143 C18 528 
C6 62 C19 4 
C7 57 C20 415 
C8 590 C21 606 
C9 763 C22 2002 
C10 2457 C23 1316 
Cll 1924 C24 291 
C12 368 C25 58 
C13 105 




















Table 1: Minimum Computation Time Savings 


There are anecdotes of system administrators manu- 
ally running a common sub-expression on the daily input 
and explicitly notifying programmers to avoid each pro- 
gram performing the computation on its own and tying 
up cluster resources. Nectar automatically supports in- 
cremental computation and programmers do not need to 
code them explicitly. As discussed in Section 2, Nectar 
tries to produce the best possible query plan using the 
cached results, significantly reducing computation time, 
at the same time making it opaque to the user. 


An unanticipated benefit of Nectar reported by our 
users on the research cluster was that it aids in debugging 
during program development. Programmers incremen- 
tally test and debug pieces of their code. With Nectar the 
debugging time significantly improved due to cache hits. 
We quantify the effect of this on the production clusters. 
We assumed that a program is a debugged version of an- 
other program if they had almost the same queries ac- 
cessing the same source data and writing the same de- 
rived data, submitted by the same user and had the same 
program name. 


Table 2 shows the amount of debugging time that can 
be saved by Nectar in the 90 day period. We present 
results for the first 12 clusters due to space constraints. 
Again, these are conservative estimates but shows sub- 
stantial savings. For instance, in Cluster C1, a minimum 
of 3 hours of debugging time can be saved per day. No- 
tice that this is actual elapsed time, i.e., each day 3 hours 
of computation on the cluster spent on debugging pro- 
grams can be avoided with Nectar. 
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Debugging Debugging 
Cluster | Time Saved | Cluster | Time Saved 
(hours) (hours) 
Cl 270 C7] 3 
C2 211 C8 35 
C3 24 C9 84 
C4 101 C10 183 
Or) 94 Cll 121 
C6 8 C12 49 




















Table 2: Actual elapsed time saved on debugging in 90 
days. 


Managing Storage 

Today, in datacenters, storage is manually managed.! 
We studied storage statistics in our 240-node research 
cluster that has been used by a significant number of 
users over the last 2 to 3 years. We crawled this clus- 
ter for derived objects and noted their last access times. 
Of the 109 TB of derived data, we discovered that about 
50% (54.5 TB) was never accessed in the last 250 days. 
This shows that users often create derived datasets and 
after a point, forget about them, leaving them occupying 
unnecessary storage space. 

We analyzed the production logs for the amount of de- 
rived datasets written. When calculating the storage oc- 
cupied by these datasets, we assumed that if a new job 
writes to the same dataset as an old job, the dataset is 
overwritten. Figure 7 shows the growth of derived data 
storage in cluster Cl. It show an approximately linear 
growth with the total storage occupied by datasets cre- 
ated in 90 days being 670 TB. 


Storage occupied by derived datasets 





Figure 7: Growth of storage occupied by derived datasets 
in Cluster Cl 


'Nectar’s motivation in automatically managing storage partly 
stems from the fact that we used to get periodic e-mail messages from 
the administrators of the production clusters requesting us to delete our 
derived objects to ease storage pressure in the cluster. 








Cluster | Projected unreferenced 
derived data (in TB) 
Cl 2712 
C5 368 
C8 863 
C13 995 
C15 210 














Table 3: Projected unreferenced data in 5 production 
clusters 


Assuming similar trends in data access time in our lo- 
cal cluster and on the production clusters, Table 3 shows 
the projected space occupied by unreferenced derived 
datasets in 5 production clusters that showed a growth 
similar to cluster C1. Any object that has not been refer- 
enced in 250 days is deemed unreferenced. This result is 
obtained by extrapolating the amount of data written by 
jobs in 90 days to 2 years based on the storage growth 
curve and predicting that 50% of that storage will not be 
accessed in the last 250 days (based on the result from 
our local cluster). As we see, production clusters cre- 
ate a large amount of derived data, which if not properly 
managed can create significant storage pressure. 


4.2 System Deployment Experience 


Each machine in our 240-node research cluster has two 
dual-core 2.6GHz AMD Opteron 2218 HE CPUs, 16GB 
RAM, four 750GB SATA drives, and runs Windows Ser- 
ver 2003 operating system. We evaluate the comparative 
performance of several programs with Nectar turned on 
and off. 

We use three datasets to evaluate the performance of 
Nectar: 

WordDoc Dataset. The first dataset is a collection of 
Web documents. Each document contains a URL and its 
content (as a list of words). The data size is 987.4 GB 
. The dataset is randomly partitioned into 236 partitions. 
Each partition has two replicas in the distributed file sys- 
tem, evenly distributed on 240 machines. 

ClickLog Dataset. The second dataset is a small sam- 
ple from an anonymized click log of a commercial search 
engine collected over five consecutive days. The dataset 
is 160GB in size, randomly partitioned into 800 parti- 
tions, two replicas each, evenly distributed on 240 ma- 
chines. 

SkyServer Dataset. This database is taken from the 
Sloan Digital Sky Survey database [11]. It contains two 
data files: 11.8 and 41.8 GBytes of data. Both files were 
manually range-partitioned into 40 partitions using the 
same keys. 
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Sub-computation Evaluation 

We have four programs: WordAnalysis, TopWord, 
MostDoc, and TopRatio that analyze the WordDoc 
dataset. 

WordAnalysis parses the dataset to generate the num- 
ber of occurrences of each word and the number of doc- 
uments that it appears in. TopWord looks for the top ten 
most commonly used words in all documents. MostDoc 
looks for the top ten words appearing in the largest num- 
ber of documents. TopRatio finds the percentage of oc- 
currences of the top ten mostly used words among all 
words. All programs take the entire 987.4 GB dataset as 
input. 











Program Name Cumulative Time Saving 
Nectar on | Nectar off 

TopWord 16.1m 21h44m | 98.8% 

MostDoc 17.5m 21h46m | 98.6% 

TopRatio 21.2m 43h30m | 99.2% 

















Table 4: Saving by sharing a common sub-computation: 
Document analysis 


With Nectar on, we can cache the results of executing 
the first program, which spends a huge amount of com- 
putation analyzing the list of documents to output an ag- 
gregated result of much smaller size (12.7 GB). The sub- 
sequent three programs share a sub-computation with the 
first program, which is satisfied from the cache. Table 4 
shows the cumulative CPU time saved for the three pro- 
grams. This behavior is not isolated, one of the programs 
that uses the ClickLog dataset shows a similar pattern; we 
do not report the results here for reasons of space. 


Incremental Computation 

We describe the performance of a program that stud- 
ies query relevance by processing the ClickLog dataset. 
When users search a phrase at a search engine, they click 
the most relevant URLs returned in the search results. 
Monitoring the URLs that are clicked the most for each 
search phrase is important to understand query relevance. 
The input to the query relevance program is the set of all 
click logs collected so far, which increases each day, be- 
cause a new log is appended daily to the dataset. This 
program is an example where the initial dataset is large, 
but the incremental updates are small. 

Table 5 shows the cumulative CPU time with Nectar 
on and off, the size of datasets and incremental updates 
each day. We see that the total size of input data increases 
each day, while the computation resource used daily in- 
creases much slower when Nectar is on. We observed 
similar performance results for another program that cal- 
culates the number of active users, who are those that 
clicked at least one search result in the past three days. 
These results are not reported here for reasons of space. 
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Data Size(GB) Time (m) Sane 
Total | Update | On Off 
Day3 | 68.20 40.50 93.0 | 107.5 | 13.49% 
Day4 | 111.25 | 43.05 | 112.9 | 194.0 | 41.80% 
Day5 | 152.19 | 40.94 | 164.6 | 325.8 | 49.66% 


























Table 5: Cumulative machine time savings for incremen- 
tal computation. 


Debugging Experience: Sky Server 

Here we demonstrate how Nectar saves program de- 
velopment time by shortening the debugging cycle. We 
select the most time-consuming query (Q18) from the 
Sloan Digital Sky Survey database [11]. The query iden- 
tifies a gravitational lens effect by comparing the loca- 
tions and colors of stars in a large astronomical table, 
using a three-way Join over two input tables contain- 
ing 11.8 GBytes and 41.8 GBytes of data, respectively. 
The query is composed of four steps, each of which is 
debugged separately. When debugging the query, the 
first step failed and the programmer modified the code. 
Within a couple of tries, the first step succeeded, and ex- 
ecution continued to the second step, which failed, and 
so on. 

Table 6 shows the average savings in cumulative time 
as each step is successively debugged with Nectar. To- 
wards the end of the program, Nectar saves as much 88% 
of the time. 




















Cumulative Time Saving 
Nectar on | Nectar off 
Step 1 47.4m 47.4m 0% 
Steps 1-2 26.5m 62.5m 58% 
Steps 1-3 35.5m 122.7m 11% 
Steps 14 15.0m 129.3m 88% 











Table 6: Debugging: SkyServer cumulative time 


5 Related Work 


Our overall system architecture is inspired by the Vesta 
system [15]. Many high-level concepts and techniques 
(e.g., the notion of primary and derived data) are directly 
taken from Vesta. However, because of the difference in 
application domains, the actual design and implementa- 
tion of the main system components such as caching and 
program rewriting are radically different. 

Many aspects of query rewriting and caching in our 
work are closely related to incremental view mainte- 
nance and materialized views in the database litera- 
ture [2, 5, 13, 19]. However, there are some important 
differences as discussed in Section 3.1. Also, we are not 
aware of the implementation of these ideas in systems 
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at the scale we describe in this paper. Incremental view 
maintenance is concerned with the problem of updating 
the materialized views incrementally (and consistently) 
when data base tables are subjected to random updates. 
Nectar is simpler in that we only consider append-only 
updates. On the other hand, Nectar is more challenging 
because we must deal with user-defined functions written 
in a general-purpose programming language. Many of 
the sophisticated view reuses given in [13] require anal- 
ysis of the SQL expressions that is difficult to do in the 
presence of user-defined functions, which are common 
in our environment. 

With the wide adoption of distributed execution 
platforms like Dryad/DryadLINQ, MapReduce/Sawzall, 
Hadoop/Pig [18, 29, 7, 25, 12, 24], recent work has in- 
vestigated job patterns and resource utilization in data 
centers [1, 14, 22, 23, 26]. These investigation of real 
work loads have revealed a vast amount of wastage in 
datacenters due to redundant computations, which is 
consistent with our findings from logs of a number of 
production clusters. 

DryadInc [26] represented our early attempt to elim- 
inate redundant computations via caching, even before 
we started on the DryadLINQ project. The caching ap- 
proach is quite similar to Nectar. However, it works at 
the level of Dryad dataflow graph, which is too general 
and too low-level for the system we wanted to build. 

The two systems that are most related to Nectar are the 
stateful bulk processing system described by Logothetis 
et al. [22] and Comet [14]. These systems mainly fo- 
cus on addressing the important problem of incremental 
computation, which is also one of the problems Nectar 
is designed to address. However, Nectar is a much more 
ambitious system, attempting to provide a comprehen- 
sive solution to the problem of automatic management 
of data and computation in a datacenter. 

As a design principle, Nectar is designed to be trans- 
parent to the users. The stateful bulk processing sys- 
tem takes a different approach by introducing new prim- 
itives and hence makes state explicit in the programming 
model. It would be interesting to understand the trade- 
offs in terms of performance and ease of programming. 

Comet, also built on top of Dryad and DryadLINQ, 
also attempted to address the sub-computation problem 
by co-scheduling multiple programs with common sub- 
computations to execute together. There are two inter- 
esting issues raised by the paper. First, when multiple 
programs are involved in caching, it is difficult to de- 
termine if two code segments from different programs 
are identical. This is particularly hard in the presence 
of user-defined functions, which is very common in the 
kind of DryadLINQ programs targeted by both Comet 
and Nectar. It is unclear how this determination is made 
in Comet. Nectar addresses this problem by building a 


sophisticated static program analyzer that allows us to 
compute the dependency of user-defined code. Second, 
co-scheduling in Comet requires submissions of multi- 
ple programs with the same timestamp. It is therefore 
not useful in all scenarios. Nectar instead shares sub- 
computations across multiple jobs executed at different 
times by using a datacenter-wide, persistent cache ser- 
vice. 

Caching function calls in a functional programming 
language is well studied in the literature [15, 21, 27]. 
Memoization avoids re-computing the same function 
calls by caching the result of past invocations. Caching 
in Nectar can be viewed as function caching in the con- 
text of large-scale distributed computing. 


6 Discussion and Conclusions 


In this paper, we described Nectar, a system that auto- 
mates the management of data and computation in dat- 
acenters. The system has been deployed on a 240-node 
research cluster, and has been in use by a small number 
of developers. Feedback has been quite positive. One 
very popular comment from our users is that the system 
makes program debugging much more interactive and 
fun. Most of us, the Nectar developers, use Nectar to 
develop Nectar on a daily basis, and found a big increase 
in our productivity. 

To validate the effectiveness of Nectar, we performed 
a systematic analysis of computation logs from 25 pro- 
duction clusters. As reported in Section 4, we have seen 
huge potential value in using Nectar to manage the com- 
putation and data in a large datacenter. Our next step is 
to work on transferring Nectar to Microsoft production 
datacenters. 

Nectar is a complex distributed systems with multi- 
ple interacting policies. Devising the right policies and 
fine-tuning their parameters to find the right trade-offs is 
essential to make the system work in practice. Our eval- 
uation of these tradeoffs has been limited, but we are ac- 
tively working on this topic. We hope we will continue to 
learn a great deal with the ongoing deployment of Nectar 
on our 240-node research cluster. 

One aspect of Nectar that we have not explored is that 
it maintains the provenance of all the derived datasets 
in the datacenter. Many important questions about data 
provenance could be answered by querying the Nectar 
cache service. We plan to investigate this further in future 
work. 

What Nectar essentially does is to unify computation 
and data, treating them interchangeably by maintaining 
the dependency between them. This allows us to greatly 
improve the datacenter management and resource utiliza- 
tion. We believe that it represents a significant step for- 
ward in automating datacenter computing. 
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Intrusion Recovery Using Selective Re-execution 
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ABSTRACT 


RETRO repairs a desktop or server after an adversary com- 
promises it, by undoing the adversary’s changes while 
preserving legitimate user actions, with minimal user in- 
volvement. During normal operation, RETRO records 
an action history graph, which is a detailed dependency 
graph describing the system’s execution. RETRO uses re- 
jinement to describe graph objects and actions at multiple 
levels of abstraction, which allows for precise dependen- 
cies. During repair, RETRO uses the action history graph 
to undo an unwanted action and its indirect effects by 
first rolling back its direct effects, and then re-executing 
legitimate actions that were influenced by that change. 
To minimize user involvement and re-execution, RETRO 
uses predicates to selectively re-execute only actions that 
were semantically affected by the adversary’s changes, 
and uses compensating actions to handle external effects. 

An evaluation of a prototype of RETRO for Linux with 
2 real-world attacks, 2 synthesized challenge attacks, and 
6 attacks from previous work, shows that RETRO can 
often repair the system without user involvement, and 
avoids false positives and negatives from previous so- 
lutions. These benefits come at the cost of 35-127% in 
execution time overhead and of 4—150 GB of log space per 
day, depending on the workload. For example, a HotCRP 
paper submission web site incurs 35% slowdown and gen- 
erates 4 GB of logs per day under the workload from 30 
minutes prior to the SOSP 2007 deadline. 


1 INTRODUCTION 


Despite our best efforts to build secure computer systems, 
intrusions are nearly unavoidable in practice. When faced 
with an intrusion, a user is typically forced to reinstall 
their system from scratch, and to manually recover any 
documents and settings they might have had. Even if the 
user diligently makes a complete backup of their system 
every day, recovering from the attack requires rolling back 
to the most recent backup before the attack, thereby losing 
any changes made since then. Since many adversaries go 
to great lengths to prevent the compromise from being 
discovered, it can take days or weeks for a user to discover 
that their machine has been broken into, resulting in a loss 
of all user work from that period of time. 

This paper presents RETRO, a system for retroactively 
undoing past attacks and their indirect effects on a single 
machine. With RETRO, an administrator specifies offend- 


ing actions from the past, such as a TCP connection or 
an HTTP request from an adversary, that they want to 
undo. RETRO then repairs the system’s state (the file sys- 
tem) by selectively undoing the offending actions—that 
is, constructing a new system state, as if the offending 
actions never took place, but all legitimate actions re- 
mained. Thus, by selectively undoing the adversary’s 
changes while preserving user data, RETRO makes intru- 
sion recovery more practical. 

To illustrate the challenges facing RETRO, consider the 
following attack, which we will use as a running example 
in this paper. Eve, an evil adversary, compromises a Linux 
machine, and obtains a root shell. To mask her trail, she 
removes the last hour’s entries from the system log. She 
then creates several backdoors into the system, including 
a new account for eve, and a PHP script that allows her to 
execute arbitrary commands via HTTP. Eve then uses one 
of these backdoors to download and install a botnet client. 
To ensure continued control of the machine, Eve adds a 
line to the /usr/bin/texi2pdf shell script (a wrapper 
for I4TgX) to restart her bot. In the meantime, legitimate 
users log in, invoke their own PHP scripts, use texi2pdf, 
and root adds new legitimate users. 

To undo attacks, RETRO provides a system-wide ar- 
chitecture for recording actions, causes, and effects in 
order to identify all the downstream effects of a compro- 
mise. The key challenge is that a compromise in the past 
may have effects on subsequent legitimate actions, espe- 
cially if the administrator discovers an attack long after it 
occurred. RETRO must sort out this entanglement auto- 
matically and efficiently. In our running example, Eve’s 
changes to the password file and to texi2pdf are entan- 
gled with legitimate actions that modified or accessed the 
password file, or used texi2pdf. If legitimate users ran 
texi2pdf, their output depended on Eve’s actions, and 
so did any programs that used that output in turn. 

As described in §2, most previous systems require user 
input to disentangle such actions. Typical previous solu- 
tions are good at detecting a compromise and allow a user 
to roll the system back to a check point before the com- 
promise, but then ask the user to incorporate legitimate 
changes from after the compromise manually; this can 
be quite onerous if the attack has happened a long time 
ago. Some solutions reduce the amount of manual work 
for special cases (e.g., known viruses). The most recent 
general solution for reducing user assistance (Taser [17]) 
incurs many false positives (undoing legitimate actions), 
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or, after white-listing some actions to minimize false posi- 
tives, it incurs false negatives (missing parts of the attack). 

How can RETRO disentangle unwanted actions from le- 
gitimate operations, and undo all effects of the adversary’s 
actions that happened in the past, while preserving every 
legitimate action? RETRO addresses these challenges with 
four ideas: 

First, RETRO models the entire system using a new 
form of a dependency graph, which we call an action his- 
tory graph. Like any dependency graph, the action history 
graph represents objects in the system (such as files and 
processes), and the dependencies between those objects 
(corresponding to actions such as a process reading a file). 
To record precise dependencies, the action history graph 
supports refinement, that is, representing the same object 
or action at multiple levels of abstraction. For example, 
a directory inode can be refined to expose individual file 
names in that directory, and a process can be refined into 
function calls and system calls. The action history graph 
also captures the semantics of each dependency (e.g., the 
arguments and return values of an action). 

Second, RETRO re-executes actions in the graph, such 
as system calls or process invocations, that were influ- 
enced by the offending changes. For example, undoing 
undesirable actions may indirectly change the inputs of 
later actions, and thus these actions must be re-executed 
with their repaired inputs. 

Third, RETRO uses predicates to do selective re- 
execution of just the actions whose dependencies are 
semantically different after repair, thereby minimizing 
cascading re-execution. For example, if Eve modified 
some file, and that file was later read by process P, we 
may be able to avoid re-executing P if the part of the file 
accessed by P is the same before and after repair. 

Finally, to selectively re-execute existing applications, 
RETRO uses shepherded re-execution to monitor the re- 
execution of processes (§5.2.3), and stops re-execution 
when the process state converges to the original execution 
(such as when a process issues an identical exec call). 

Using a prototype of RETRO for Linux, we show that 
RETRO can recover from both real-world and synthetic 
attacks, including our running example, while preserving 
legitimate user changes. Out of ten experiment scenarios, 
six required no user input to repair, two required user 
confirmation that a conflicting login session belonged to 
the attacker, and two required the user to manually redo 
affected operations. We also show that RETRO’s ideas of 
refinement, shepherded re-execution, and predicates are 
key to repairing precisely the files affected by the attack, 
and to minimizing user involvement. A performance eval- 
uation shows that, for extreme workloads that issue many 
system calls (such as continuously recompiling the Linux 
kernel), RETRO imposes a 89—127% runtime overhead 
and requires 100-150 GB of log space per day. For a 
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more realistic application, such as a HotCRP [23] confer- 
ence submission site, these costs are 35% and 4 GB per 
day, respectively. RETRO’s runtime cost can be reduced 
by using additional cores, amounting to 0% for HotCRP 
when one core is dedicated to RETRO. 

The rest of the paper is organized as follows. The next 
section compares RETRO with related work. §3 presents 
an overview of RETRO’s architecture and workflow. 84 
discusses RETRO’s action history graph in detail, and 
§5 describes RETRO’s repair managers. Our prototype 
implementation is described in 86, and §7 evaluates the 
effectiveness and performance of RETRO. Finally, §8 dis- 
cusses the limitations and future work, and §9 concludes. 


2 RELATED WORK 


This section relates RETRO to industrial and academic 
solutions for recovery after a compromise, and prior tech- 
niques that RETRO builds on. 


2.1 Repair solutions 


One line of industrial solutions is anti-virus tools, which 
can revert changes made by common malware, such as 
Windows registry keys and files comprising a known virus. 
For example, tools such as [34] can generate remediation 
procedures for a given piece of malware. While such 
techniques work for known malware that behaves in pre- 
dictable ways, they incur both false positives and false 
negatives, especially for new or unpredictable malware, 
and may not be able to recover from attacks where some 
information is lost, such as file deletions or overwrites. 
They also cannot repair changes that were a side-effect of 
the attack, such as changes made by a trojaned program, 
or changes made by an interactive adversary, whereas 
RETRO can undo such changes. 

Another line of industrial solutions is systems that help 
users roll back unwanted changes to system state. These 
solutions include Windows System Restore [18], Win- 
dows Driver Rollback [30], Time Machine [4], and numer- 
ous backup tools. These tools perform coarse-grained re- 
covery, and require the user to identify what files were af- 
fected. RETRO uses the action history graph to track down 
all effects of an attack, repairs precisely those changes, 
and repairs all side-effects of the attack, without requiring 
the user to guess what files were affected. 

A final line of popular solutions is using virtual ma- 
chines as a form of whole-system backup. Using Re- 
Virt [14] or Moka5 [11, 31], an administrator can roll 
back to a checkpoint before an attack, losing both the 
attacker’s changes and any legitimate changes since that 
point. One could imagine a system that replays recorded 
legitimate network packets to the virtual machine to re- 
apply legitimate changes. However, if there are even 
subtle dependencies between omitted and replayed pack- 
ets, the replayed packets will result in conflicts or external 
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Figure 1: Overview of RETRO’s architecture, including major components and their interactions. Shading indicates components introduced by 
RETRO. Striped shading of checkpoints indicates that RETRO reuses existing file system snapshots when available. 


dependencies, requiring user input to proceed. By record- 
ing dependencies and re-executing actions at many levels 
of abstraction using refinement, RETRO avoids such con- 
flicts and can preserve legitimate changes without user 
input. 

Academic research has tried to improve over the in- 
dustrial solutions by attempting to make solutions more 
automatic. Brown’s undoable email store [10] shows how 
an email server can recover from operator mistakes, by 
turning all operations into verbs, such as SMTP or IMAP 
commands. Unlike RETRO, Brown’s approach is limited 
to recovering from accidental operator mistakes. As a 
result, it cannot deal with an adversary that goes outside 
of the verb model and takes advantage of a vulnerability 
in the IMAP server software, or guesses root’s password 
to log in via ssh. Moreover, it cannot recover from actions 
that had system-wide effects spanning multiple applica- 
tions, files, and processes. 

The closest related work to RETRO is Taser [17], which 
uses taint tracking to find files affected by a past attack. 
Taser suffers from false positives, erroneously rolling back 
hundreds or thousands of files. To prevent false positives, 
Taser uses a white-list to ignore taint for some nodes or 
edges. This causes false negatives, so an attacker can 
bypass Taser altogether. While extensions of Taser catch 
some classes of attacks missed due to false negatives [40], 
RETRO has no need for white-listing. RETRO recovers 
from all attacks presented in the Taser paper with no 
false positives or false negatives. RETRO avoids Taser’s 
limitations by using a design based on the action history 
graph, and techniques such as predicates and re-execution, 
as opposed to Taser’s taint propagation. 

Polygraph [29] uses taint tracking to recover from com- 
promised devices in a data replication system, and incurs 
false positives like Taser. Unlike RETRO, Polygraph can 
recover from compromises in a distributed system. 


2.2 Related techniques 


The use of dependency information for security has been 
widely explored in many contexts, including informa- 


tion flow control [25, 45], taint tracking [44], data prove- 
nance [9], forensics [21], system integrity [8], and so 
on. A key difference in RETRO’s action history graph 
is the use of exact dependency data to decide whether a 
dependency has semantically changed at repair time. 

RETRO assumes that intrusion detection and analysis 
tools, such as [7, 12, 14, 15, 19-22, 24, 40, 43], detect 
attacks and pinpoint attack edges. RETRO’s intrusion de- 
tection is based on BackTracker [21]. A difference is that 
RETRO’s action history graph records more information 
than BackTracker, which RETRO needs for repair (but 
doesn’t use yet for detection). 

Transactions [33, 36] help revert unwanted changes 
before commit, whereas RETRO can selectively undo 
“committed” actions. Database systems use compensating 
transactions to revert committed transactions, including 
malicious transactions [3, 27]; RETRO similarly uses com- 
pensating actions to deal with externally-visible changes. 


3 OVERVIEW 


RETRO consists of several components, as shown in Fig- 
ure 1. During normal execution, RETRO’s kernel module 
records a log of system execution, and creates periodic 
checkpoints of file system state. When the system ad- 
ministrator notices a problem, he or she uses RETRO to 
track down the initial intrusion point. Given an intrusion 
point, RETRO reverts the intrusion, and repairs the rest 
of the system state, relying on the system administrator 
to resolve any conflicts (e.g., both the adversary and a 
legitimate user modified the same line of the password 
file). The rest of this section describes these phases of 
operation in more detail, and outlines the assumptions 
made by RETRO about the system and the adversary. 


Normal execution. As the computer executes, RETRO 
must record sufficient information to be able to revert 
the effects of an attack. To this end, RETRO records 
periodic checkpoints of persistent state (the file system), 
so that it can later roll back to a checkpoint. RETRO 
does not require any specialized format for its file system 
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checkpoints; if the file system already creates periodic 
snapshots, such as [26, 32, 37, 38], RETRO can simply 
use these snapshots, and requires no checkpointing of its 
own. In addition to rollback, RETRO must be able to re- 
execute affected computations. To this end, RETRO logs 
actions executed over time, along with their dependencies. 
The resulting checkpoints and actions comprise RETRO’s 
action history graph, such as the one shown in Figure 2. 

The action history graph consists of two kinds of ob- 
jects: data objects, such as files, and actor objects, such 
as processes. Each object has a set of checkpoints, rep- 
resenting a copy of its state at different points in time. 
Each actor object additionally consists of a set of actions, 
representing the execution of that actor over some period 
of time. Each action has dependencies from and to other 
objects in the graph, representing the objects accessed 
and modified by that action. Actions and checkpoints of 
adjacent objects are ordered with respect to each other, in 
the order in which they occurred.! 

RETRO stores the action history graph in a series of log 
files over time. When RETRO needs more space for new 
log files, it garbage-collects older log files (by deleting 
them). Log files are only useful to RETRO in conjunction 
with a checkpoint that precedes the log files, so log files 
with no preceding checkpoint can be garbage-collected. 
In practice, this means that RETRO keeps checkpoints 
for at least as long as the log files. By design, RETRO 
cannot recover from an intrusion whose log files have 
been garbage collected; thus, the amount of log space 
allocated to logs and checkpoints controls RETRO’s re- 
covery “horizon”. For example, a web server running the 
HotCRP paper review software [23] logs 4 GB of data per 
day, so if the administrator dedicates a 2 TB disk ($100) 
to RETRO, he or she can recover from attacks within the 
past year, although these numbers strongly depend on the 
application. 


Intrusion detection. At some point after an adversary 
compromises the system, the system administrator learns 
of the intrusion, perhaps with the help of an intrusion 
detection system. To repair from the intrusion, the system 
administrator must first track down the initial intrusion 
point, such as the adversary’s network connection, or 
a user accidentally running a malware binary. RETRO 
provides a tool similar to BackTracker [21] that helps 
the administrator find the intrusion point, starting from 
the observed symptoms, by leveraging RETRO’s action 
history graph. In the rest of this paper, we assume that an 
intrusion detection system exists, and we do not describe 
our BackTracker-like tool in any more detail. 


Repair. Once the administrator finds the intrusion point, 
he or she reboots the system, to discard non-persistent 


'For simplicity, our prototype globally orders all checkpoints and 
actions for all objects. 
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state, and invokes RETRO’s repair controller, specifying 
the name of the intrusion point determined in the previous 
step.? The repair controller undoes the offending action, 
A, by rolling back objects modified by A to a previous 
checkpoint, and replacing A with a no-op in the action 
history graph. Then, using the action history graph, the 
controller determines which other actions were poten- 
tially influenced by A (e.g., the values of their arguments 
changed), rolls back the objects they depend on (e.g., 
their arguments) to a previous checkpoint, re-executes 
those actions in their corrected environment (e.g., with 
the rolled-back arguments), and then repeats the process 
for actions that the re-executed actions may have influ- 
enced. This process will also undo subsequent actions 
by the adversary, since the action that initially caused 
them, A, has been reverted. Thus, after repair, the system 
will contain the effects of all legitimate actions since the 
compromise, but none of the effects of the attack. 

To minimize re-execution and to avoid potential con- 
flicts, the repair controller checks whether the inputs to 
each action are semantically equivalent to the inputs dur- 
ing original execution, and skips re-execution in that case. 
In our running example, if Alice’s sshd process reads a 
password file that Eve modified, it might not be necessary 
to re-execute sshd if its execution only depended on AI- 
ice’s password entry, and Eve did not change that entry. If 
Alice’s sshd later changed her password entry, then this 
change will not result in a conflict during repair because 
the repair controller will determine that her change to the 
password file could not have been influenced by Eve. 

RETRO’s repair controller must manipulate many kinds 
of objects (e.g., files, directories, processes, etc.) and 
re-execute many types of actions (e.g., system calls and 
function calls) during repair. To ensure that RETRO’s de- 
sign is extensible, RETRO’s action history graph provides 
a well-defined API between the repair controller and in- 
dividual graph objects and actions. Using this API, the 
repair controller implements a generic repair algorithm, 
and interacts with the graph through individual repair 
managers associated with each object and action in the 
action history graph. Each repair manager, in turn, tracks 
the state associated with their respective object or action, 
implements object/action-specific operations during re- 
pair, and efficiently stores and accesses the on-disk state, 
logs, and checkpoints. 


External dependencies. During repair, RETRO may 
discover that changes made by the adversary were ex- 
ternally visible. RETRO relies on compensating actions to 
deal with external dependencies where possible. For ex- 
ample, if a user’s terminal output changes, RETRO sends 


Each object and action in the action history graph has a unique 
name, as described in §5. 
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a diff between the old and new terminal sessions to the 
user in question. 

In some cases, RETRO does not have a compensat- 
ing action to apply. If Eve, from our running example, 
connected to her botnet client over the network, RETRO 
would not be able to re-execute the connection during 
repair (the connection will be refused since the botnet 
will no longer be running). When such a situation arises, 
RETRO’s repair controller pauses re-execution and asks 
the administrator to manually re-execute the appropriate 
action. In the case of Eve’s connection, the administra- 
tor can safely do nothing and tell the repair controller to 
resume. 


Assumptions. RETRO makes three significant assump- 
tions. First, RETRO assumes that the system administrator 
detects intrusions in a timely manner, that is, before the 
relevant logs are garbage-collected. An adversary that is 
aware of RETRO could compromise the system and then 
try to avoid detection, by minimizing any activity until 
RETRO garbage-collects the logs from the initial intru- 
sion. If the initial intrusion is not detected in time, the 
administrator will not be able to revert it directly, but this 
strategy would greatly slow down attackers. Moreover, 
the administrator may be able to revert subsequent actions 
by the adversary that leveraged the initial intrusion to 
cause subsequent notable activity. 

Second, RETRO assumes that the administrator 
promptly detects any intrusions with wide-ranging effects 
on the execution of the entire system. If such intrusions 
persist for a long time, RETRO will require re-execution 
of large parts of the system, potentially incurring many 
conflicts and requiring significant user input. However, 
we believe this assumption is often reasonable, since the 
goal of many adversaries is to remain undetected for as 
long as possible (e.g., to send more spam, or to build up a 
large botnet), and making pervasive changes to the system 
increases the risk of detection. 

Third, for this paper, we assume that the adversary com- 
promises a computer system through user-level services. 
The adversary may install new programs, add backdoors 
to existing programs, modify persistent state and con- 
figuration files, and so on, but we assume the adversary 
doesn’t tamper with the kernel, file system, checkpoints, 
or logs. RETRO’s techniques rely on a detailed under- 
standing of operating system objects, and our assumptions 
allow RETRO to trust the kernel state of these objects. We 
rely on existing techniques for hardening the kernel, such 
as [16, 28, 39, 41], to achieve this goal in practice. 


4 ACTION HISTORY GRAPH 


RETRO’s design centers around the action history graph, 
which represents the execution of the entire system over 


Time 
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Figure 2: A simplified view of the action history graph depicting Eve’s 
attack in our running example. In this graph, attacker Eve adds an 
account for herself to /etc/passwd, after which root adds an account 
for Alice, and Alice logs in via ssh. As an example, we consider Eve’s 
write to the password file to be the attack action, although in reality, 
the attack action would likely be the network connection that spawned 
Eve’s process in the first place. Not shown are intermediate data objects, 
and system call actors, described in §4.3 and Figure 4. 


time. The action history graph must address four require- 
ments in order to disentangle attacker actions from le- 
gitimate operations. First, it must operate system-wide, 
capturing all dependencies and actions, to ensure that 
RETRO can detect and repair all effects of an intrusion. 
Second, the graph must support fine-grained re-execution 
of just the actions affected by the intrusion, without hav- 
ing to re-execute unaffected actions. Third, the graph 
must be able to disambiguate attack actions from legiti- 
mate operations whenever possible, without introducing 
false dependencies. Finally, recording and accessing the 
action history graph must be efficient, to reduce both run- 
time overheads and repair time. The rest of this section 
describes the design of RETRO’s action history graph. 


4.1 Repair using the action history graph 


RETRO represents an attack as a set of attack actions. For 
example, an attack action can be a process reading data 
from the attacker’s TCP connection, a user inadvertently 
running malware, or an offending file write. Given a set 
of attack actions, RETRO repairs the system in two steps, 
as follows. 

First, RETRO replaces the attack actions with benign 
actions in the action history graph. For example, if the 
attack action was a process reading a malicious request 
from the attacker’s TCP connection, RETRO removes the 
request data, as if the attacker never sent any data on that 
connection. If the attack action was a user accidentally 
running malware, RETRO changes the user’s exec system 
call to run /bin/true instead of the malware binary. 
Finally, if the attack action was an unwanted write to a 
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Function or variable Semantics 





set(checkpt) object.checkpts 
void object.rollback(c) 
set(action) actor_object.actions 
set(action) data_object.readers 
set(action) data_object.writers 
set(data_object) data_object.parts 


Set of available checkpoints for this object. 

Roll back this object to checkpoint c. 

Set of actions that comprise this actor object. 

Set of actions that have a dependency from this data object. 
Set of actions that have a dependency to this data object. 
Set of data objects whose state is part of this data object. 





action.actor 
action.inputs 
action.outputs 


actor _object 
set(data_object) 
set(data_object) 





bool action.equiv() 
bool action.connect() 
void action.redo() 


Actor containing this action. 

Set of data objects that this action depends on. 

Set of data objects that depend on this action. 

Check whether any inputs of this action have changed. 

Add dependencies for new inputs and outputs, based on new inputs. 
Re-execute this action, updating output objects. 


Figure 3: Object (top) and action (bottom) repair manager API. 


file, as in Figure 2, RETRO replaces the action with a zero- 
byte write. RETRO includes a handful of such benign 
actions used to neutralize intrusion points found by the 
administrator. 

Second, RETRO repairs the system state to reflect the 
above changes, by iteratively re-executing affected ac- 
tions, starting with the benign replacements of the at- 
tack actions themselves. Prior to re-executing an action, 
RETRO must roll back all input and output objects of that 
action, as well as the actor itself, to an earlier checkpoint. 
For example, in Figure 2, RETRO rolls back the output of 
the attack action—namely, the password file object—to 
its earlier checkpoint. 

RETRO then considers all actions with dependencies to 
or from the objects in question, according to their time 
order. Actions with dependencies fo the object in question 
are re-executed, to reconstruct the object. For actions 
with dependencies from the object in question, RETRO 
checks whether their inputs are semantically equivalent 
to their inputs during original execution. If the inputs 
are different, such as the useradd command reading the 
modified password file in Figure 2, the action will be 
re-executed, following the same process as above. On 
the other hand, if the inputs are semantically equivalent, 
RETRO skips re-execution, avoiding the repair cascade. 
For example, re-executing sshd may be unnecessary, if 
the password file entry accessed by sshd is the same 
before and after repair. We will describe shortly how 
RETRO determines this (in $4.4 and Figure 5). 


4.2 Graph API 


As described above, repairing the system requires three 
functions: rolling back objects to a checkpoint, re- 
executing actions, and checking an action’s input depen- 
dencies for semantic equivalence. To support different 
types of objects and actions in a system-wide action his- 
tory graph, RETRO delegates these tasks, as well as track- 
ing the graph structure itself, to repair managers associ- 
ated with each object and action in the graph. 
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A manager consists of two halves: a runtime half, re- 
sponsible for recording logs and checkpoints during nor- 
mal execution, and a repair-time half, responsible for 
repairing the system state once the system administrator 
invokes RETRO to repair an intrusion. The runtime half 
has no pre-defined API, and needs to only synchronize 
its log and checkpoint format with the repair-time half. 
On the other hand, the repair-time half has a well-defined 
API, shown in Figure 3. 


Object manager. During normal execution, object 
managers are responsible for making periodic checkpoints 
of objects. For example, the file system manager takes 
snapshots of files, such as a copy of /etc/passwd in Fig- 
ure 2. Process objects also have checkpoints in the graph, 
although in our prototype, the only supported process 
checkpoint is the initial state of a process immediately 
prior to exec. 

During repair, an object manager is responsible for 
maintaining the state represented by its object. For per- 
sistent objects, the manager uses the on-disk state, such 
as the actual file for a file object. For ephemeral objects, 
such as processes or pipes, the manager keeps a temporary 
in-memory representation to help action managers redo 
actions and check predicates, as we describe in §5. 

An object manager provides one main procedure in- 
voked during repair, o.rollback(v), which rolls back ob- 
ject o’s state to checkpoint v. For a file object, this means 
restoring the on-disk file from snapshot v. For a pro- 
cess, this means constructing an initial, paused process in 
preparation for redoing exec, as we will discuss in §5.2.3; 
since there is only one kind of process checkpoint, v is 
not used. If the object was last checkpointed long ago, 
RETRO will need to re-execute all subsequent actions that 
modified the data object, or that comprise the actor object. 


Action manager. During normal execution, action man- 
agers are responsible for recording all actions executed 
by actors in the system. For each action, the manager 
records enough information to re-execute the same action 
at repair time, as well as to check whether the inputs are 
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semantically equivalent (e.g., by recording the data read 
from a file). 

At repair time, an action manager provides three proce- 
dures. First, a.redo() re-executes action a, reading new 
data from a’s input objects and modifying the state of 
a’s output objects. For example, redoing a file write ac- 
tion modifies the corresponding file in the file system; if 
the action was not otherwise modified, this would write 
the same data to the same offset as during original ex- 
ecution. Second, a.equiv() checks whether a’s inputs 
have semantically changed since the original execution. 
For instance, equiv on a file read action checks whether 
the file contains the same data at the same offset (and, 
therefore, whether the read call would return the same 
data). Finally, a.connect() updates action a’s input and 
output dependencies, in case that changed inputs result in 
the action reading or modifying new objects. To ensure 
that past dependencies are not lost, connect only adds, 
and never removes, dependencies (even if the action in 
question does not use that dependency). 


4.3 Refining actor objects: 
Finer-grained re-execution 


An important goal of RETRO’s design is minimizing re- 
execution, so as to avoid the need for user input to handle 
potential conflicts and external dependencies. It is of- 
ten necessary to re-execute a subset of an actor’s actions, 
but not necessarily the entire actor. For example, after 
rolling back a file like /etc/passwd to a checkpoint that 
was taken long ago, RETRO needs to replay all writes 
to that file, but should not need to re-execute the pro- 
cesses that issued those writes. Similarly, in Figure 2, 
RETRO would ideally re-execute only a part of sshd that 
checks whether Alice’s password entry is the same, and 
if so, avoid re-executing the rest of sshd, which would 
lead to an external dependency because cryptographic 
keys would need to be re-negotiated. Unfortunately, re- 
executing a process from an intermediate state is difficult 
without process checkpointing. 

To address this challenge, RETRO refines actors in the 
action history graph to explicitly denote parts of a pro- 
cess that can be independently re-executed. For example, 
RETRO models every system call issued by a process by a 
separate system call actor, comprising a single system call 
action, as shown in Figure 4. The system call arguments, 
and the result of the system call, are explicitly represented 
by system call argument and return value objects. This 
allows RETRO to re-execute individual system calls when 
necessary (e.g., to re-construct a file during repair), while 
avoiding re-execution of entire processes if the return 
values of system calls remain the same. 

The same technique is also applied to re-execute spe- 
cific functions instead of an entire process. Figure 5 shows 
a part of the action history graph for our running example, 
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Figure 4: An illustration of the system call actor object and arguments 
and return value data objects, for Eve’s write to the password file from 
Figure 2. Legend is the same as in Figure 2. 
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Figure 5: An illustration of refinement in an action history graph, de- 
picting the use of additional actors to represent a re-executable call to 
getpwnam from sshd. Legend is the same as in Figure 2. 


in which sshd creates a separate actor to represent its call 
to getpwnam("alice"). While getpwnam’s execution 
depends on the entire password file, and thus must be 
re-executed if the password file changes, its return value 
contains only Alice’s password entry. If re-execution 
of getpwnam produces the same result, the rest of sshd 
need not be re-executed. §5 describes such higher-level 
managers in more detail. 


The same mechanism helps RETRO create benign re- 
placements for attack actions. For example, in order 
to undo a user accidentally executing malware, RETRO 
changes the exec system call’s arguments to invoke 
/bin/true instead of the malware binary. To do this, 
RETRO synthesizes a new checkpoint for the object repre- 
senting exec’s arguments, replacing the original malware 
binary path with /bin/true, and rolls back that object to 
the newly-created “checkpoint”, as illustrated in Figure 6 
and §4.5. 
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4.4 Refining data objects: 
Finer-grained data dependencies 


While OS-level dependencies ensure completeness, they 
can be too coarse-grained, leading to false dependencies, 
such as every process depending on the /tmp directory. 
RETRO’s design addresses this problem by refining the 
same state at different levels of abstraction in the graph 
when necessary. For instance, a directory manager creates 
individual objects for each file name in a directory, and 
helps disambiguate directory lookups and modifications 
by recording dependencies on specific file names. 

The challenge in supporting refinement in the action 
history graph lies in dealing with multiple objects repre- 
senting the same state. For example, the state of a single 
directory entry is a part of both the directory manager’s 
object for that specific file name, as well as the file man- 
ager’s node for that directory’s inode. On one hand, we 
would like to avoid creating dependencies to and from the 
underlying directory inode, to prevent false dependencies. 
On the other hand, if some process does directly read the 
underlying directory inode’s contents, it should depend 
on all of the directory entries in that directory. 

To address this challenge, each object in RETRO keeps 
track of other objects that represent parts of its state. For 
example, the manager of each directory inode keeps track 
of all the directory entry objects for that directory. The ob- 
ject manager exposes this set of parts through the o.parts 
property, as shown in Figure 3. In most cases, the man- 
ager tracks its parts through hierarchical names, as we 
discuss in §5. 

RETRO’s OS manager records all dependencies, even 
if the same dependency is also recorded by a higher-level 
manager. This means that RETRO can determine trust 
in higher-level dependencies at repair time. If the appro- 
priate manager mediated all modifications to the larger 
object (such as a directory inode), and the manager was 
not compromised, RETRO can safely use finer-grained 
objects (such as individual directory entry objects). Oth- 
erwise, RETRO uses coarse-grained but safe OS-level 
dependencies. 


4.5 Repair controller 


RETRO uses a repair controller to repair system state with 
the help of object and action managers. Figure 6 sum- 
marizes the pseudo-code for the repair controller. The 
controller, starting from the REPAIR function, creates a 
parallel “repaired” timeline by re-executing actions in the 
order that they were originally executed. To do so, the 
controller maintains a set of objects that it is currently 
repairing (the nodes hash table), along with the last action 
that it performed on that object. REPAIRLOOP continu- 
ously attempts to re-execute the next action, until it has 
considered all actions, at which point the system state is 
fully repaired. 
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function ROLLBACK(node, checkpt) 
node.rollback(checkpt) 
state[node] := checkpt 


function PREPAREREDO(action) 

if saction.connect() then return FALSE 

if state[action.actor] > action then 
cps := action.actor.checkpts 
cp := max(c € cps|c < action) 
ROLLBACK(action.actor, cp) 
return FALSE 

for all o € (action.inputs U action.outputs) do 
if state[o] < action then continue 
ROLLBACK(o0, max(c € o.checkpts|c < action)) 
return FALSE 

return TRUE 


function PICK ACTION() 
actions := 0 
for all o € state | ois actor object do 
actions += min(a € o.actions|a > state[o]) 
for all o € state | ois data object do 
actions += min(a € o.readers U 
o.writers|a > state[o]) 
return min(actions) 


function REPAIRLOOP() 
while a := PICKACTION() do 
if a.equiv() and state[o] > a, 
Vo € a.outputs U a.actor then 
for all € a.inputs M keys(state) do 
stateli] =a 

continue > skip semantically-equivalent action 

if PREPAREREDO(a) then 


a.redo() 
for all o € a.inputs U a.outputs U a.actor do 
state[o] := a 


function REPAIR(repair_obj, repair_cp) 
ROLLBACK(repair_obj, repair_cp) 
REPAIRLOOP( ) 


Figure 6: The repair algorithm. 


To choose the next action for re-execution, REPAIR- 
Loop invokes PICKACTION, which chooses the earliest 
action that hasn’t been re-executed yet, out of all the ob- 
jects being repaired. If the action’s inputs are the same 
(according to equiv), and none of the outputs of the ac- 
tion need to be reconstructed, REPAIRLOOP does not 
re-execute the action, and just advances the state of the 
action’s input nodes. If the action needs to be re-executed, 
REPAIRLOOP invokes PREPAREREDO, which ensures 
that the action’s actor, input objects, and output objects 
are all in the right state to re-execute the action (by rolling 
back these objects when appropriate). Once PREPARE- 
REDO indicates it is ready, REPAIRLOOP re-executes the 
action and updates the state of the actor, input, and output 
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objects. Finally, REPAIR invokes REPAIRLOOP in the 
first place, after rolling back repair_obj to the (newly- 
synthesized) checkpoint repair _cp, as described in §4.3. 

Not shown in the pseudo-code is handling of refined 
objects. When the controller rolls back an object that has 
a non-empty set of parts, it must consider re-executing 
actions associated with those parts, in addition to actions 
associated with the larger object. Also not shown is the 
checking of integrity for higher-level dependencies, as 
described in §4.4. 


5 OBJECT AND ACTION MANAGERS 


This section describes RETRO’s object and action man- 
agers, starting with the file system and OS managers that 
guarantee completeness of the graph, and followed by 
higher-level managers that provide finer-grained depen- 
dencies for application-specific parts of the graph. 


5.1 File system manager 


The file system manager is responsible for all file objects. 
To uniquely identify files, the manager names file objects 
by (device, part, inode). The device and part components 
identify the disk and partition holding the file system. 
Our current prototype disallows direct access to partition 
block devices, so that file system dependencies are always 
trusted. The inode number identifies a specific file by in- 
ode, without regard to path name. To ensure that files can 
be uniquely identified by inode number, the file system 
manager prevents inode reuse until all checkpoints and 
logs referring to the inode have been garbage-collected. 

During normal operation, the file system manager must 
periodically checkpoint its objects (including files and 
directories), using any checkpointing strategy. Our im- 
plementation relies on a snapshotting file system to make 
periodic snapshots of the entire file system tree (e.g., once 
per day). This works well for systems which already cre- 
ate daily snapshots [26, 32, 37, 38], where the file system 
manager can simply leverage existing snapshots. Upon 
file deletion, the file system manager moves the deleted 
inode into a special directory, so that it can reuse the same 
exact inode number on rollback. The manager preserves 
the inode’s data contents, so that RETRO can undo an 
unlink operation by simply linking the inode back into a 
directory (see §5.3). 

During repair, the file system manager’s rollback 
method uses a special kernel module to open the check- 
pointed file as well as the current file by their inode num- 
ber. Once the repair manager obtain a file descriptor for 
both inodes, it overwrites the current file’s contents with 
the checkpoint’s contents, or re-constructs an identical set 
of directory entries, for directory inodes. On rollback to a 
file system snapshot where the inode in question was not 
allocated yet, the file system manager truncates the file to 
zero bytes, as if it was freshly created. As a precaution, 


the file system manager creates a new file system snapshot 
before initiating any rollback. 


5.2. OS manager 


The OS manager is responsible for process and system 
call actors, and their actions. The manager names each 
process in the graph by (bootgen, pid, pidgen, execgen). 
bootgen is a boot-up generation number to distinguish 
process IDs across reboots. pid is the Unix process 
ID, and pidgen is a generation number for the pro- 
cess ID, used to distinguish recycled process IDs. Fi- 
nally, execgen counts the number of times a process 
called the exec system call; the OS manager logically 
treats exec as creating a new process, albeit with the 
same process ID. The manager names system calls by 
(bootgen, pid, pidgen, execgen, sysid), where sysid is a 
per-process unique ID for that system call invocation. 


5.2.1 Recording normal execution 


During normal execution, the OS manager intercepts 
and records all system calls that create dependencies to 
or from other objects (i.e., not getpid, etc), recording 
enough information about the system calls to both re- 
execute them at repair time, and to check whether the 
inputs to the system call are semantically equivalent. The 
OS manager creates nominal checkpoints of process and 
system call actors. Since checkpointing of processes mid- 
execution is difficult [13, 35], our OS manager check- 
points actors only in their “initial” state immediately prior 
to exec, denoted by |. The OS manager also keeps 
track of objects representing ephemeral state, including 
pipes and special devices such as /dev/null. Although 
RETRO does not attempt to repair this state, having these 
objects in the graph helps track and check dependen- 
cies using equiv during repair, and to perform partial 
re-execution. 


5.2.2 Action history graph representation 


In the action history graph, the OS manager represents 
each system call by two actions in the process actor, two 
intermediate data objects, and a system call actor and ac- 
tion, as shown in Figure 4. The first process action, called 
the syscall invocation action, represents the execution of 
the process up until it invokes the system call. This action 
conceptually places the system call arguments, and any 
other relevant state, into the system call arguments object. 
For example, the arguments for a file write include the 
target inode, the offset, and the data. The arguments for 
exec, on the other hand, include additional information 
that allows re-executing the system call actor without hav- 
ing to re-execute the process actor, such as the current 
working directory, file descriptors not marked O_CLOEXEC, 
and so on. 
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The system call action, in a separate actor, conceptually 
reads the arguments from this object, performs the system 
call (incurring dependencies to corresponding objects), 
and writes the return value and any returned data into 
the return value object. For example, a write system 
call action, shown in Figure 4, creates a dependency to 
the modified file, and stores the number of bytes written 
into the return value object. Finally, the second process 
action, called the syscall return action, reads the returned 
data from that object, and resumes process execution. In 
case of fork or exec, the OS manager creates two return 
objects and two syscall return actions, representing return 
values to both the old and new process actors. Thus, every 
process actor starts with a syscall return action, with a 
dependency from the return object for fork or exec. 

In addition to system calls, Unix processes interact 
with memory-mapped files. RETRO cannot re-execute 
memory-mapped file accesses without re-executing the 
process. Thus, the OS manager associates dependencies 
to and from memory-mapped files with the process’s own 
actions, as opposed to actions in a system call actor. In par- 
ticular, every process action (either syscall invocation or 
return) has a dependency from every file memory-mapped 
by the process at that time, and a dependency to every file 
memory-mapped as writable at that time. 


5.2.3 Shepherded re-execution 


During repair, the OS manager must re-execute two types 
of actors: process actors and system call actors. For sys- 
tem call actors, when the repair controller invokes redo, 
the OS manager reads the (possibly changed) values from 
the system call arguments object, executes the system call 
in question, and places return data into the return object. 
equiv on a system call action checks whether the input 
objects have the same values as during the original ex- 
ecution. Finally, connect reads the (possibly changed) 
inputs, and creates any new dependencies that result. For 
example, if a stat system call could not find the named 
file during original execution, but RETRO restores the file 
during repair, connect would create a new dependency 
from the newly-restored file. 

For process actors, the OS manager represents the 
state of a process during repair with an actual process 
being shepherded via the ptrace debug interface. On 
p.rollback(L), the OS manager creates a fresh process 
for process object p under ptrace. When the repair 
controller invokes redo on a syscall return action, the 
OS manager reads the return data from the correspond- 
ing system call return object, updates the process state 
using PTRACE_POKEDATA and PTRACE_SETREGS, and al- 
lows the process to execute until it’s about to invoke the 
next system call. equiv on a system call return action 
checks if the data in the system call return object is the 
same as during the original execution. When the repair 
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controller invokes redo on the subsequent syscall invo- 
cation action, the OS manager simply marshals the argu- 
ments for the system call invocation into the correspond- 
ing system call arguments object. This allows the repair 
controller to separately schedule the re-execution of the 
system call, or to re-use previously recorded return data. 
Finally, connect does nothing for process actions. 

One challenge for the OS manager is to deal with pro- 
cesses that issue different system calls during re-execution. 
The challenge lies in matching up system calls recorded 
during original execution with system calls actually is- 
sued by the process during re-execution. The OS manager 
employs greedy heuristics to match up the two system 
call streams. If a new syscall does not match a previously- 
recorded syscall in order, the OS manager creates new 
system call actions, actors, and objects (as shown in Fig- 
ure 4). Similarly, if a previously-recorded syscall does not 
match the re-executed system calls in order, the OS man- 
ager replaces the previously-recorded syscall’s actions 
with no-ops. In the worst case, the only matches will be 
the initial return from fork or exec, and the final syscall 
invocation that terminates the process, potentially leading 
to more re-execution, but not a loss of correctness. 

In our running example, Eve trojans the texi2pdf 
shell script by adding an extra line to start her botnet 
worker. After repairing the texi2pdf file, RETRO re- 
executes every process that ran the trojaned texi2pdf. 
During shepherded re-execution of texi2pdf, exec sys- 
tem calls to legitimate TEX programs are identical to 
those during the original execution; in other words, the 
system call argument objects are equivalent, and equiv on 
the system call action returns true. As a result, there is no 
need to re-execute these child processes. However, exec 
system calls to Eve’s bot are missing, so the manager 
replaces them with no-ops, which recursively undoes any 
changes made by Eve’s bot. 


5.3. Directory manager 


The directory manager is responsible for exposing finer- 
grained dependency information about directory entries. 
Although the file system manager tracks changes to di- 
rectories, it treats the entire directory as one inode, caus- 
ing false dependencies in shared directories like /tmp. 
The directory manager names each directory entry by 
(device, part, inode, name). The first three components 
of the name are the file system manager’s name for the 
directory inode. The name part represents the file name 
of the directory entry. 

During normal operation, the directory manager must 
record checkpoints of its objects, conceptually consist- 
ing of the inode number for the directory entry (or L to 
represent non-existent directory entries). However, since 
the file system manager already records checkpoints of 
all directories, the directory manager relies on the file 


USENIX Association 


USENIX Association 


system manager’s checkpoints, and does not perform any 
checkpointing of its own. The directory manager simi- 
larly relies on the OS manager to record dependencies 
between system call actions and directory entries accessed 
by those system calls, such as name lookups in namei 
(which incur a dependency from every directory entry 
traversed), or directory modifications by rename (which 
incur a dependency to the modified directory entries). 
During repair, the directory manager’s sole responsibil- 
ity is rolling back directory entries to a checkpoint; the 
OS manager handles redo of all system calls. To roll back 
a directory entry to an earlier checkpoint, the directory 
manager finds the inode number contained in that direc- 
tory entry (using the file system manager’s checkpoint), 
and changes the directory entry in question to point to 
that inode, with the help of RETRO’s kernel module. If 
the directory entry did not exist in the checkpoint, the 
directory manager similarly unlinks the directory entry. 


5.4 System library managers 


Every user login on a typical Unix system accesses sev- 
eral system-wide files. For example, each login attempt 
accesses the entire password file, and successful logins 
update both the utmp file (tracking currently logged in 
users) and the lastlog file (tracking each user’s last 
login). In a naive system, these shared files can lead to 
false dependencies, making it difficult to disambiguate 
attacker actions from legitimate changes. To address this 
problem, RETRO uses a libc system library manager to 
expose the semantic independence between these actions. 

One strawman approach would be to represent such 
shared files much as directories (i.e., creating a separate 
object for each user’s password file entry). However, un- 
like the directory manager, which mediates all accesses to 
a directory, a manager for a function in /ibc cannot guar- 
antee that an attacker will not bypass it—the manager, 
libc, and the attacker can be in the same address space. 
Thus, the 1ibc manager does not change the representa- 
tion of data objects, and instead simplifies re-execution, 
by creating actors to represent the execution of individual 
libc functions. For example, Figure 5 shows an actor for 
the getpwnam function call as part of sshd. 

During normal operation, the library manager cre- 
ates a fresh actor for each function call to one of the 
managed functions, such as getpwnam, getspnam, and 
getgrouplist. The library manager names function 
call actors by (bootgen, pid, pidgen, execgen, callgen); 
the first four parts name the process, and callgen is a 
unique ID for each function call. Much as with system 
call actors, the arguments object contains the function 
name and arguments, and the return object contains the 
return value. Like processes, function call actors have 
only one checkpoint, |, representing their initial state 
prior to the call. 


The library manager requires the OS manager’s help to 
associate system calls issued from inside library functions 
with the function call actor, instead of the process actor. 
To do this, the OS manager maintains a “call stack” of 
function call actors that are currently executing. On every 
function call, the library manager pushes the new function 
call actor onto the call stack, and on return, it pops the 
call stack. The OS manager associates syscall invocation 
and return actions with the last actor on the call stack, if 
any, instead of the process actor. 

During repair, the library manager’s rollback and redo 
methods allow the repair controller to re-execute individ- 
ual functions. For example, in Figure 5, the controller 
will re-execute getpwnam, because its dependency on 
/etc/passwd changed due to repair. However, if equiv 
indicates the return value from getpwnam did not change, 
the controller need not re-execute the rest of sshd. 

RETRO’s trust assumption about the library manager 
is that the function does not semantically affect the rest 
of the program’s execution other than through its return 
value. If an attacker process compromises its own libc 
manager, this does not pose a problem, because the pro- 
cess already depended on the attacker in other ways, and 
RETRO will repair it. However, if an attacker exploits a 
vulnerability in the function’s input parsing code (such as 
a buffer overflow in getpwnam parsing /etc/passwd), 
it can take control of getpwnam, and influence the ex- 
ecution of the process in ways other than getpwnam’s 
return value. Thus, RETRO trusts libc functions wrapped 
by the library manager to safely parse files and faithfully 
represent their return values. 


5.5 Terminal manager 


Undoing attacker’s actions during repair can result in 
legitimate applications sending different output to a user’s 
terminal. For example, if the userran 1s /tmp, the output 
may have included temporary files created by the attacker, 
or the 1s binary was trojaned by the attacker to hide 
certain files. While RETRO cannot undo what the user 
already saw, the terminal manager helps RETRO generate 
compensating actions. 

The terminal manager is responsible for objects repre- 
senting pseudo-terminal, or pty, devices (/dev/pts/N in 
Linux). During normal operation, the manager records 
the user associated with each pty (with help from sshd), 
and all output sent to the pty. During repair, if the output 
sent to the pty differs from the output recorded during 
normal operation, the terminal manager computes a text 
diff between the two outputs, and emails it to the user. 


5.6 Network manager 


The network manager is responsible for compensating 
for externally-visible changes. To this end, the network 
manager maintains objects representing the outside world 
(one object for each TCP connection, and one object for 
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each IP address/UDP port pair). During normal operation, 
the network manager records all traffic, similar to the 
terminal manager. 

During repair, the network manager compares repaired 
outgoing data with the original execution. When the 
network manager detects a change in outgoing traffic, it 
flags an external dependency, and presents the user or 
administrator with three choices. The first choice is to 
ignore the dependency, which is appropriate for network 
connections associated with the adversary (such as Eve’s 
login session in our running example, which will generate 
different network traffic during repair). The second choice 
is to re-send the network traffic, and wait for a response 
from the outside world. This is appropriate for outgoing 
network connections and idempotent protocols, such as 
DNS. Finally, the third choice is to require the user to 
manually resolve the external dependency, such as by 
manually re-playing the traffic for incoming connections. 
This is necessary if, say, the response to an incoming 
SMTP connection has changed, the application did not 
provide its own compensating action, and the user does 
not want to ignore this dependency. 


6 IMPLEMENTATION 


We implemented a prototype of RETRO for Linux,* com- 
ponents of which are summarized in Figure 7. During 
normal execution, a kernel module intercepts and records 
all system calls to a log file, implementing the runtime 
half of the OS, file system, directory, terminal, and net- 
work managers. To allow incremental loading of log 
records, RETRO records an index alongside the log file 
that allows efficient lookup of records for a given process 
ID or inode number. The file system manager implements 
checkpoints using subvolume snapshots in btrfs [37]. The 
libc manager logs function calls using a new RETRO sys- 
tem call to add ordered records to the system-wide log. 
The repair controller, and the repair-time half of each 
manager, are implemented as Python modules. 

RETRO implements three optimizations to reduce log- 
ging costs. First, it records SHA-1 hashes of data read 
from files, instead of the actual data. This allows checking 
for equivalence at repair time, but avoids storing the data 
twice. Second, it does not record data read or written 
by white-listed deterministic processes (in our prototype, 
this includes gcc and 1d). This means that, if any of the 
read or write dependencies to or from these processes are 
suspected during repair, the entire process will have to 
be re-executed, because individual read and write system 
calls cannot be checked for equivalence or re-executed. 
Since all of the dependency relationships are preserved, 
this optimization trades off repair time for recording time, 


3While our prototype is Linux-specific, we believe that RETRO’s 
approach is equally applicable to other operating systems. 
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Component 

Logging kernel module 

Repair controller, manager modules 
System library managers 
Backtracking GUI tool 


| Lines of code 
3,300 lines of C 
5,000 lines of Python 
700 lines of C 
500 lines of Python 





Figure 7: Components of our RETRO prototype, and an estimate of 
their complexity, in terms of lines of code. 











Objects repaired Objects repaired User 
Attack with predicates without predicates | input 
Proc Func File | Proc Func File 
Password change 1 2 4 430 20 274 1 
Log cleaning 59 0 40 60 0 40 0 
Running example 58 57. 75 51361 300 1 
sshd trojan 530 47 303 530 47 303 3 











Figure 8: Repair statistics for the two honeypot attacks (top) and two 
synthetic attacks (bottom). The repaired objects are broken down into 
processes, functions (from libc), and files. Intermediate objects such as 
syscall arguments are not shown. The concurrent workload consisted of 
1,261 process, function, and file objects (both actor and data objects), 
and 16,239 system call actions. RETRO was able to fully repair all 
attacks, with no false positives or false negatives. User input indicate the 
number of times RETRO asked for user assistance in repair; the nature 
of the conflict is reported in §7. 


but does not compromise completeness. Third, RETRO 
compresses the resulting log files to save space. 


7 EVALUATION 


This section answers three questions about RETRO, in 
turn. First, what kinds of attacks can RETRO recover 
from, and how much user input does it require? Second, 
are all of RETRO’s mechanisms necessary in practice? 
And finally, what are the performance costs of RETRO, 
both during normal execution and during repair? 


7.1 Recovery from attack 


To evaluate how RETRO recovers from different attacks, 
we used three classes of attack scenarios. First, to make 
sure we can repair real-world attacks, we used attacks 
recorded by a honeypot. Second, to make sure RETRO 
can repair worst-case attacks, we used synthetic attacks 
designed to be particularly challenging for RETRO, in- 
cluding the attack from our running example. For both 
real-world and synthetic attacks, we perform user activity 
described in the running example after the attack takes 
place—namely, root logs in via ssh and adds an account 
for Alice, who then also logs in via ssh to edit and build a 
7px file. Finally, we compare RETRO to Taser, the state- 
of-the-art attack recovery system, using attack scenarios 
from the Taser paper [17]. 


Honeypot attacks. To collect real-world attacks, we 
ran a honeypot [1] for three weeks, with a modified sshd 
that accepted any password for login as root. Out of 
many root logins, we chose two attacks that corrupted 
our honeypot’s state in the most interesting ways.* In the 
first attack, the attacker changed the root password. In the 
second attack, the attacker downloaded and ran a Linux 


4Most of the attackers simply ran a botnet binary or a port scanner. 
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Taser 





Scenario Snapshot Nol NoIAN  NoIANC RETRO User input required 

Illegal storage FP FP FN FN v None. 

Content destruction FP v v FN v None. (Generates terminal diff compensating action.) 
Unhappy student FP FP v FN v None. (Generates terminal diff compensating action.) 
Compromised database FP FP FP FN v None. 

Software installation FP FP v v v Re-execute browser (or ignore browser state changes). 
Inexperienced admin FP FP FP v v Skip re-execution of attacker’s login session. 


Figure 9: A comparison of Taser’s four policies and RETRO against a set of scenarios used to evaluate Taser [17]. Taser’s snapshot policy tracks all 
dependencies, Nol ignores IPC and signals, NoIAN also ignores file name and attributes, and NoIANC further ignores file content. FP indicates a 
false positive (undoing legitimate actions), FN indicates a false negative (missing parts of the attack), and V indicates no false positives or negatives. 


binary that scrubbed system log files of any mention of 
the attacker’s login attempt. 

For both of these attacks, RETRO was able to repair 
the system while preserving all legitimate user actions, as 
summarized in Figure 8. In the password change attack, 
root was unable to log in after the attack, immediately 
exposing the compromise, although we still logged in 
as Alice and ran texi2pdf. In the second attack, all 59 
repaired processes were from the attacker’s log cleaning 
program, whose effects were undone. 

For these real-world attacks, RETRO required minimal 
user input. RETRO required one piece of user input to 
repair the password change attack, because root’s login 
attempt truly depended on root’s entry in /etc/passwd, 
which was modified by the attacker. In our experiment, 
the user told the network manager to ignore the conflict. 
RETRO required no user input for the log cleaning attack. 


Synthetic attacks. To check if RETRO can recover 
from more insidious attacks, we constructed two synthetic 
attacks involving trojans; results for both are summarized 
in Figure 8. For the first synthetic attack, we used the 
running example, where the attacker adds an account for 
eve, installs a botnet and a backdoor PHP script, and tro- 
jans the /usr/bin/texi2pdf shell script to restart the 
botnet. Legitimate users were unaware of this attack, and 
performed the same actions. Once the administrator de- 
tected the attack, RETRO reverted Eve’s changes, includ- 
ing the eve account, the bot, and the trojan. As described 
in §5.2.3, RETRO used shepherded re-execution to undo 
the effects of the trojan without re-running the bulk of the 
trojaned application. As Figure 8 indicates, RETRO re- 
executed several functions (getpwnam) to check if remov- 
ing eve’s account affected any subsequent logins. One 
login session was affected—Eve’s login—and RETRO’s 
network manager required user input to confirm that Eve’s 
login need not be re-executed. 

One problem we discovered when repairing the running 
example attack is that the UID chosen for Alice by root’s 
useradd alice command depends on whether eve’s ac- 
count is present. If RETRO simply re-executed useradd 
alice, useradd would pick a different UID during re- 
execution, requiring RETRO to re-execute Alice’s entire 
session. Instead, we made the useradd command part of 


the system library manager, so that during repair, it first 
tries to re-execute the action of adding user alice under 
the original UID, and only if that fails does it re-execute 
the full useradd program. This ensures that Alice’s UID 
remains the same even after RETRO removes the eve 
account (as long as Alice’s UID is still available). 

A second synthetic attack we tried was to trojan 
/usr/sbin/sshd. In this case, users were able to log 
in as usual, but undoing the attack required re-executing 
their login sessions with a good sshd binary. Because 
RETRO cannot rerun the remote ssh clients (and a new key 
exchange, resulting in different keys, makes TCP-level 
replay useless), RETRO’s network manager asks the ad- 
ministrator to redo each ssh session manually. Of course, 
this would not be practical on a real system, and the ad- 
ministrator may instead resort to manually auditing the 
files affected by those login sessions, to verify whether 
they were affected by the attack in any way. However, we 
believe it is valuable for RETRO to identify all connections 
affected by the attack, so as to help the administrator lo- 
cate potentially affected files. In practice, we hope that an 
intrusion detection system can notice such wide-reaching 
attacks; after a few user logins, the dependency graph 
indicates that unrelated user logins are all dependent on a 
previous login session, which an IDS may be able to flag. 


Taser attacks. Finally, we compare RETRO to the state- 
of-the-art intrusion recovery system, Taser, under the 
attack scenarios that were used to originally evaluate 
Taser [17]. Figure 9 summarizes the results. 

In the first scenario, illegal storage, the attacker creates 
a new account for herself, stores illegal content on the 
system, and trojans the 1s binary to mask the illegal 
content. RETRO rolls back the account, illegal files, and 
the trojaned 1s binary, and uses the legitimate 1s binary to 
re-execute all 1s processes from the past. Even though the 
trojaned 1s binary hid some files, the legitimate 1s binary 
produces the same output, because RETRO removes the 
hidden files during repair. As a result, there is no need 
to notify the user. If 1s’s output did change, the terminal 
manager would have sent a diff to the affected users. 

In the content destruction scenario, an attacker deletes 
a user’s files. Once the user notices the problem, he 
uses RETRO to undo the attack. After recovering the 
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Without RETRO With RETRO 





Workload Log size Snapshot size | #ofobjects # of actions 
1 core 1 core 2 cores 

Kernel build 295 sec 557 sec 351 sec 761 MB 308 MB 87,405 5,698,750 

Web server 7260 req/s 3195 req/s 5453 req/s 98 MB 272 KB 508 185,315 

HotCRP 20.4 req/s 15.1 req/s 20.0 req/s 81 MB 27 MB 19,969 939,418 


Figure 10: Performance and storage costs of RETRO for three workloads: building the Linux kernel, serving files as fast as possible using Apache [2] 
for 1 minute, and simulating requests to HotCRP [23] from the 30 minutes before the SOSP 2007 deadline, which averaged 2.1 requests per 
second [44] (running as fast as possible, this workload finished in 3-4 minutes). “# of objects” reflects the number of files, directory entries, and 
processes; not included are intermediate objects such as system call arguments. “# of actions” reflects the number of system call actions. 


files, RETRO generates a terminal output diff for the login 
session during which the user noticed the missing files 
(after repair, the user’s 1s command displays those files). 


In the unhappy student scenario, a student exploits an 
f£tpd bug to change permissions on a professor’s grade 
file, then modifies the grade file in another login session, 
and finally a second accomplice user logs in and makes a 
copy of the grade file. In repairing the attack, RETRO rolls 
back the grade file and its permissions, re-executes the 
copy command (which now fails), and uses the terminal 
manager to generate a diff for the attackers’ sessions, 
informing them that their copy command now failed. 


In the compromised database scenario, an attacker 
breaks into a server, modifies some database records (in 
our case we used SQLite), and subsequently a legitimate 
user logs in and runs a script that updates database records 
of its own. RETRO rolls back the database file to a state 
before the attack, and re-executes the database update 
script to preserve subsequent changes, with no user input. 


In the software installation scenario, the administrator 
installs the wrong browser plugin, and only detects this 
problem after running the browser and downloading some 
files. During repair, RETRO rolls back the incorrect plu- 
gin, and attempts to repair the browser using re-execution. 
Since RETRO encounters external dependencies in re- 
executing network applications, it requests the user to 
manually redo any interactions with the browser. In our 
experiment, the user ignored this external dependency, 
because he knew the browser made no changes to local 
state worth preserving. 


In the inexperienced admin scenario, root selects a 
weak password for a user account, and an attacker guesses 
the password and logs in as the user. Undoing root’s pass- 
word change affects the attacker’s login session, requiring 
one user input to confirm to the network manager that it’s 
safe to discard the attacker’s TCP connection. 


In summary, RETRO correctly repairs all six attack 
scenarios posed by Taser, requiring user input only in two 
cases: to re-execute the browser, and to confirm that it’s 
safe to drop the attacker’s login session. Taser requires 
application-specific policies to repair these attacks, and 
some attacks cannot be fully repaired under any policy. 
Taser’s policies also open up the system to false negatives, 
allowing an adversary to bypass Taser altogether. 
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7.2 Technique effectiveness 


In this subsection, we evaluate the effectiveness of 
RETRO’Ss specific techniques, including re-execution, 
predicate checking, and refinement. 

Re-execution is key to preserving legitimate user ac- 
tions. As described in §7.1 and quantified in Figure 8, 
RETRO re-executes several processes and functions to pre- 
serve and repair legitimate changes. Without re-execution, 
RETRO would have to conservatively roll back any files 
touched by the process in question, much like Taser’s 
snapshot policy, which incurs false positives. 

Without predicates, RETRO would have to perform 
conservative dependency propagation in the dependency 
graph. As in Taser, dependencies on attack actions 
quickly propagate to most objects in the graph, requir- 
ing re-execution of almost every process. This leads 
to re-execution of sshd, which requires user assistance. 
Figure 8 shows that many of the objects repaired with- 
out predicates were not repaired with predicates enabled. 
Taser would roll back all of these objects (false positives). 
Thus, predicates are an important technique to minimize 
user input due to re-execution. 

Without refinement of actor and data objects, 
RETRO would incur false dependencies via /tmp and 
/etc/passwd. As Figure 8 shows, several functions 
(such as getpwnam) were re-executed in repairing from 
attacks. If RETRO was unable to re-execute just those 
functions, it would have re-executed processes like sshd, 
forcing the network manager to request user input. Thus, 
refinement is important to minimizing user input due to 
false dependencies. 


7.3 Performance 


We evaluate RETRO’s performance costs in two ways. 
First, we consider costs of RETRO’s logging during nor- 
mal execution. To this end, we measure the CPU overhead 
and log size for several workloads. Figure 10 summarizes 
the results. We ran our experiments on a 2.8GHz Intel 
Core i7 system with 8 GB RAM running a 64-bit Linux 
2.6.35 kernel, with either one or two cores enabled. 

The worst-case workload for RETRO is a system that 
uses 100% of CPU time and spends most of its time com- 
municating between small processes. One such extreme 
workload is a system that continuously re-builds the Linux 
kernel; another example is an Apache server continuously 
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serving small static files. For such systems, RETRO in- 
curs a 89-127% CPU overhead using a single core, and 
generates about 100-150 GB of logs per day. A 2 TB 
disk ($100) can store two weeks of logs at this rate before 
having to garbage-collect older log entries. If a spare 
second core is available, and the application cannot take 
advantage of it, it can be used for logging, resulting in 
only 18-33% CPU overhead. 

For a more realistic application, such as a HotCRP [23] 
paper submission web site, RETRO incurs much less 
overhead, since HotCRP’s PHP code is relatively CPU- 
intensive. If we extrapolate the workload from the 30 
minutes before the SOSP 2007 deadline [44] to an entire 
day, HotCRP would incur 35% CPU overhead on a single 
core (and almost no overhead if an additional unused core 
were available), and use about 4 GB of log space per day. 
We believe that these are reasonable costs to pay to be 
able to recover integrity after a compromise of a paper 
submission web site. 

Second, we consider the time cost of repairing a sys- 
tem using RETRO after an attack. As Figure 8 illustrated, 
RETRO is often effective at repairing only a small subset 
of objects and actions in the action history graph, and for 
attacks that affect the entire system state, such as the sshd 
trojan, user input dominates repair costs. To illustrate the 
costs of repairing a subset of the action history graph, 
we measure the time taken by RETRO to repair from a 
micro-benchmark attack, where the adversary adds an 
extraneous line to a log file, which is subsequently mod- 
ified by a legitimate process. When only this attack is 
present in RETRO’s log (consisting of 10 process objects, 
126 file objects, and 399 system call actions), repair takes 
0.3 seconds. When this attack runs concurrently with a 
kernel build (as shown in Figure 10), repair of the attack 
takes 4.7 seconds (10x longer), despite the fact that the 
log is 10,000 larger. This shows that RETRO’s log in- 
dexing makes repair time depend largely on the number 
of affected objects, rather than the overall log size. 


8 DISCUSSION AND FUTURE WORK 


An important assumption of RETRO is that the attacker 
does not compromise the kernel. Unfortunately, security 
vulnerabilities are periodically discovered in the Linux 
kernel [5, 6], making this assumption potentially danger- 
ous. One solution may be to use virtual machine based 
techniques [14, 21], although it is difficult to distinguish 
kernel objects after a kernel compromise. We plan to 
explore ways of reducing trust in future work. 

In our current prototype, if attackers compromise the 
kernel and obtain access to RETRO’s log files, they may 
be able to extract sensitive information, such as user pass- 
words or keys, that would not have been persistently 
stored on a system without RETRO. One possible so- 
lution may be to encrypt the log files and checkpoints, 


so that the administrator must reboot the system from a 
trusted CD and enter the password to initiate recovery. 

Our current prototype can only repair the effects of an 
attack on a single machine, and relies on compensating 
actions to repair external state. In future work, we plan 
to explore ways to extend automated repair to distributed 
systems, perhaps based on the ideas from [29, 42]. 

RETRO requires the system administrator to specify 
the initial intrusion point in order to undo the effects 
of the attack, and finding the initial intrusion point can 
be difficult. In future work, we hope to leverage the 
extensive data available in RETRO’s dependency graph 
to build intrusion detection tools that can better pin-point 
intrusions. Alternatively, instead of trying to pinpoint 
the attack, we may be able to use RETRO to retroactively 
apply security patches into the past, and re-execute any 
affected computations, thus eliminating any attacks that 
exploited the vulnerability in question. 

We did not have space to address several practical as- 
pects of using RETRO, such as performing multiple re- 
pairs or undoing a repair. These operations translate into 
making additional checkpoints, and updating the graph 
accordingly after repair. Also, as hinted at in §5, we plan 
to explore the use of more specialized repair managers, 
such as managers for a language runtime, a database, or 
an application like a web server or web browser. Finally, 
while RETRO’s performance and storage overheads are 
already acceptable for some workloads, we plan to further 
reduce them by not logging intermediate dependencies 
that can be reconstructed at repair time. 


9 CONCLUSION 


RETRO repairs system integrity from past attacks by using 
an action history graph to track system-wide dependen- 
cies, roll back affected objects, and re-execute legitimate 
actions affected by the attack. RETRO minimizes user 
input by avoiding re-execution whenever possible, and 
by using compensating actions for external dependencies. 
RETRO’s key techniques for minimizing re-execution in- 
clude predicates, refinement, and shepherded re-execution. 
A prototype of RETRO for Linux recovers from a mix of 
ten real-world and synthetic attacks, repairing all side- 
effects of the attack in all cases. Six attacks required no 
user input to repair, and RETRO required significant user 
input in only two cases involving trojaned network-facing 
applications. 
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Static Checking of Dynamically- Varying Security Policies in 
Database-Backed Applications 
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Abstract 


We present a system for sound static checking of security 
policies for database-backed Web applications. Our tool 
checks a combination of access control and information 
flow policies, where the policies vary based on database 
contents. For instance, one or more database tables may 
represent an access control matrix, controlling who may 
read or write which cells of these and other tables. Us- 
ing symbolic evaluation and automated theorem-proving, 
our tool checks these policies statically, requiring no pro- 
gram annotations (beyond the policies themselves) and 
adding no run-time overhead. Specifications come in the 
form of SQL queries as policies: for instance, an appli- 
cation’s confidentiality policy is a fixed set of queries, 
whose results provide an upper bound on what infor- 
mation may be released to the user. To provide user- 
dependent policies, we allow queries to depend on what 
secrets the user knows. We have used our prototype im- 
plementation to check several programs representative of 
the data-centric Web applications that are common today. 


1 Introduction 


Much of today’s most important software exists as 
Web applications, and many of these applications are 
thin interface layers for relational databases. Real- 
world requirements impel developers to implement many 
application-specific schemes for access control (“who 
can do what?’’) and information flow (“who can learn 
what?’’). To reason about correctness of these implemen- 
tations, the programmer must consider all possible flows 
of control through a program. 

This task is hard enough if a security policy can be 
expressed statically, as, for instance, a list of which of 
a fixed set of principals is allowed to perform each of a 
fixed set of actions. However, the needs of real applica- 
tions tend to force use of evolving security policies, and 
usually the most convenient place to store a policy is in 


the same database where the rest of application data re- 
sides. For instance, a database often encodes some kind 
of access control matrix, where entries reference rows of 
other tables. The peculiar structure of an organization 
may require access control based on customized schema 
design and checking code. An effective security valida- 
tion tool must be able to “understand” these policies. 
Many program analysis and instrumentation schemes 
have been applied to provide some automatic assurance 
of security properties. In this space, the traditional di- 
chotomy is between dynamic and static tools, based on 
whether checking happens at run time or compile time. 
The two extremes have their characteristic advantages. 


e Dynamic analysis can often be implemented with- 
out requiring any program annotations included 
solely to make analysis easier. 


e Real developers have an easier time writing spec- 
ifications compatible with dynamic analysis, since 
these specifications can often be arbitrary code for 
inspecting program states. 


e Static analysis can provide strong guarantees that 
hold for all possible program executions, even those 
exercising weird corner cases that may not have 
been considered. 


e Static analysis adds no run-time overhead. 


In this paper, we present a tool UrFlow for static anal- 
ysis of database-backed Web applications. We have tried 
to reap some of all of the advantages just described. Our 
tool requires no program annotations and provides fully 
sound static assurance about all possible executions of a 
program, and it requires no changes to the run-time be- 
havior of programs. We take advantage of the fact that 
it is already common for Web applications to be imple- 
mented at quite a high level, relying on an SQL engine 
to implement the key data structures. Our tool models 
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the semantics of SQL faithfully, at a level that makes for- 
mal, automated analysis quite practical. We use popular 
ideas from symbolic execution and automated theorem- 
proving to build detailed models of program behavior 
automatically, which saves developers the trouble of ex- 
plaining these models with code annotations. 

It is natural for developers to write specifications that 
look much like the program code they are already writ- 
ing. Traditional assertions (e.g., with the C assert 
macro) fall under this heading. In an application that de- 
pends on an SQL engine to manage its main data struc- 
tures, it seems similarly natural to express security poli- 
cies using SQL. Our tool is based on that model, allowing 
developers to write detailed statically-checkable specifi- 
cations without learning a new language. Queries can 
express confidentiality properties by selecting which in- 
formation the user may learn, and queries can express 
database update properties by selecting allowable state 
transitions. We need only one extension to the standard 
SQL syntax and semantics: to allow policies to vary by 
user, we introduce explicit consideration of which secrets 
(e.g., passwords) the user knows. 

UrFlow is integrated with the compiler for Ur/Web [3], 
a domain-specific language for Web application develop- 
ment. Ur/Web presents a very high-level view of the do- 
main, with explicit language support for the key elements 
of Web applications. For instance, the SQL interface uses 
an expressive type system to ensure that any code that 
type-checks accesses the SQL database correctly. In the 
present project, we have used the first-class SQL support 
to avoid the need for program analysis to recover a high- 
level view of how an application uses the database. 

We begin by introducing our policy model and demon- 
strating its versatility. After that, we present our pro- 
gram analysis, including its symbolic evaluation and au- 
tomated theorem-proving aspects. Next, we discuss the 
scope and limitations of our analysis, describe some 
case-study applications that we have checked with Ur- 
Flow, and compare with related work. 


2 SQL Queries as Policies 


Consider a simple application that maintains a database 
of users and per-user secret strings. We can declare our 
schema to Ur/Web with table declarations. Following 
standard practice in relational databases, each table in- 
cludes a unique integer ID, which provides a convenient 
handle to pass to row-specific operations. Besides an ID, 
a user record contains a username and password, and 
a secret record contains the owning user ID and the 
data value. 


{ Id 
Pass 


table user int, Nam string, 


string } 
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table secret : { Id 
Data 


int, User int, 


string } 


We also declare an HTTP cookie, which acts like a 
typed global variable which exists separately on each 
Web browser. This cookie tracks the authentication in- 
formation for the currently logged-in user. While a more 
realistic program would probably rely on unique session 
IDs, here we adopt the less secure strategy of storing a 
user ID and password pair in each cookie, to simplify the 
example. 


cookie login { Id int, Pass string } 


We can write a function that checks this cookie and 
returns its user ID if the password is correct. The code 
is written in a functional style, where we collapse “ex- 
pressions” and “statements” into a single syntactic class. 
Thus, instead of determining the function return value 
with explicit return statements, we just say that the 
function result is the value of the single expression that 
is the function body. 

Ur/Web code makes a lot of use of tagged unions, a 
safe analogue to C unions that is popular in functional 
programming languages. A tagged union value is either a 
simple tag, which is like an enum value in C; or a pairing 
of a tag and another value, which is like a C union, but 
with a convention to ensure that it is always possible to 
inspect a value and determine which union alternative is 
being used. For tag T, a simple tag expression is written 
like T, while the pairing of that tag with expression e is 
written T(e). For instance, instead of allowing every 
object type to be inhabited by a special value null, we 
instead represent null with an explicit tag None, and 
we represent non-null object o as Some (0) . A pattern- 
matching construct case is used to deconstruct tagged 
union values. 

Here is the code for a function to check the correct- 
ness of the information in the login cookie. It is writ- 
ten in a compiler intermediate language in which some 
higher-order functional programming idioms have been 
replaced with more standard imperative code. 


fun userId() = 
case getCookie(login) of 
None => None 


























| Some (li) => 

let b = query 
(SELECT COUNT(«) > 0 AS B 
FROM user 
WHERE user.Id = {li.Id} 

AND user.Pass = {1li.Pass}) 

(cr acc => r.B) False in 

if b then 
Some (1i.Id) 

else 


error("Wrong user ID or password!") 
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Our userTd function begins by retrieving the current 
value of the Login cookie. This will either be None, 
if no value of the cookie is set; or Some (1i), if the 
ID/password record 1i has been set as the cookie value. 
If the cookie is not set, there is no user ID to return. Oth- 
erwise, we must consult the database to see if the pass- 
word is correct. 

We have literal SQL syntax embedded in the code, 
with splicing of variable values using curly braces. The 
query checks if there are any rows in the user table 
matching the cookie contents. In this intermediate lan- 
guage, every database read is expressed as a loop over 
the results of a query. The body of the loop is written as 
an expression with two explicitly-named new local vari- 
ables: r, the latest row to process; and acc, an accu- 
mulator that is modified as we process rows. The body 
expression after the => determines the new accumulator 
value after every iteration. We give False as the initial 
accumulator value. In our example here, the loop body 
ignores the accumulator, and we simply project the one 
field of any result row to save as the accumulator. The 
error function aborts program execution with an error 
message, which we do here when the user provides in- 
valid credentials. 

We can write the main entry point of our application 
to display all of the logged-in user’s secrets. 


fun main() = 
case userlId() of 
None => write("You’re not logged in.") 
| Some (u) => 


























query (SELECT secret.Id, secret.Data 
FROM secret 
WHERE secret.User = {u}) 
(r acc => 

write ("<li> <i>"); 

write (toString (r.Secret.Id)); 

write("</i>: "); 

write (escape (r.Secret.Data)); 

write("</li>")) () 


In this query loop, the accumulator is still ignored, and 
in fact we execute the function body solely for its side 
effects, which involve writing HTML to be sent to the 
client. 

We would like to verify that this application satisfies 
a reasonable confidentiality policy. Intuitively, every cell 
of the database belongs to a particular user. We want to 
ensure that no user is able to read cells belonging to a 
different user. This simple policy expresses our intent 
for the cells of the user table. 











policy sendClient (SELECT * 
FROM user 
WHERE known (user.Pass) ) 

















The informal meaning of this policy is that the user 
may learn any value that could be returned from this 
query. Every policy statement is followed by a key- 
word naming a kind of policy. In this case, that keyword 
is sendClient, which is used for confidentiality poli- 
cies. Specifically, the user may learn anything about any 
row of user whose password he knows. The new pred- 
icate known models which information the client is al- 
ready aware of. We assume the client knows the text of 
the program and the text of the HTTP request it sent. In 
our example, when we disclose any secret information, 
we know that the user’s own password is known because 
it came from the login cookie, which was part of the 
incoming HTTP request. 

A more complicated policy allows the release of infor- 
mation about secrets. 











policy sendClient (SELECT +* 
FROM secret, user 
WHERE secret.User = user.Id 
AND known (user.Pass) ) 

















We use a join between the secret and user tables, 
requiring that the client demonstrate knowledge of the 
password for the user who owns the secret. 

Our tool verifies that the application satisfies these se- 
curity policies. That is, every cell of the database whose 
value might be disclosed could have been selected by one 
of these queries, based on an interpretation of known 
drawn from the HTTP request that prompted an execu- 
tion. 

There are several opportunities for mistakes in imple- 
menting the policy. Consider what would happen if we 
had implemented userId to always return 17. When 
we run the compiler, we get an error message. The com- 
piler tells us which secret may be leaked, and (in addition 
to the location of the offending write) we are given a first- 
order logic characterization of the state of the program at 
the time when the leak might occur. 


User learns: r.Secret.Data 
Hypotheses: secret (xl), 
r = {Secret = 
{Id = xl.Id, Data = x1.Data}}, 
x1l.User = 17 


The hypotheses are generated directly from the SQL 
query in main. The first hypothesis tells us that row x1 
is in the secret table. Our row variable r is equated 
with a record built by projecting the requested fields from 
x1, and the last hypothesis represents the WHERE clause. 

In the correct implementation, UrFlow explores every 
static path through the program, maintaining a logical 
state at each point. When the analysis reaches the point 
that triggered the error above, we have this more infor- 
mative state. 
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c = cookie/login, known(c), 

c = Some(c2), user(xl), 

xl.Id = c2.Id, xl.Pass = c2.Pass, 
secret (x2), x2.User = c2.I1d, 
{Secret = {Id = x2.Id, Data = 


r= x2.Data}} 


The variable c stands for the cookie value, which is 
asserted to be known to the user. The SQL query from 
userId is reflected with assertions about a variable x1, 
which is the row of user that must have matched the 
query for execution to reach this point. The confiden- 
tiality policy used a join between secret and user to 
describe when information on secrets may be released. 
The program code, on the other hand, contains no joins. 
UrFlow understands join semantics to the point where it 
is able to deduce that the above logical state implies that 
a join, performed as in the policy, would authorize the 
release of everything included in the record r. 


2.1 What is Being Checked? 


We can give a simple characterization of exactly what 
confidentiality property the analyzer enforces, as a func- 
tion of the policy the user specifies. First, we need to 
define exactly what we mean by the known predicate. In- 
formally, a known piece of data is something that the user 
is already aware of, so that no confidentiality require- 
ment is violated by echoing back that value or another 
value derived from it in a predictable way. More for- 
mally, known is the most restrictive predicate satisfying 
the following rules: 


1. Any constant appearing in the program text is 
known. 


2. The initial value of every cookie is known. These 
cookies may have arbitrary structured types, as in 
the record type given to the Login cookie in the 
last example. 


3. The value of every explicit parameter to the appli- 
cation is known. For page requests generated by 
submission of HTML forms, this includes all form 
field values. 


4. Arecord is known iff all of its fields are known. 


5. For any union tag T (e.g., Some in our example), a 
value v is known iff T (v) is known. 


We say that a value v is allowed in a specific database 
state D if there exists a sendClient policy that, when 
executed in state D, would return v as one of its outputs. 
We say that a value v is built from a set S if v is in S 
or can be constructed out of the elements of S by com- 
bining a subset of them with record and tagged union 
operations. 
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Now we can give a concise description of exactly what 
UrFlow checks. For any execution of a program that the 
analysis approved: 


1. Whenever a write command sends some value v 
to the client, v is built from the set of values that are 
known or allowed. 


2. Whenever the program branches based on the value 
v of some test expression, such that the branch cho- 
sen influences what might be sent to the client later, 
v is built from the set of values that are known or 
allowed. This prevents some implicit flows, where 
the very fact that a program reaches a particular line 
of code may reveal secret information. Since im- 
plicit flows are a notorious source of false alarms in 
information flow analysis, programmers might want 
to turn off this piece of checking, which would be 
easy to do via a compiler flag. 


The same kind of characterization does not work well 
for ruling out implicit flows induced by SQL WHERE 
clauses, so we leave additional checking of that kind for 
future work. This means that a checked program may 
leak information about the existence of rows, based on 
tests against arbitrary SQL expressions, but the contents 
of those rows will not be leaked directly. 





2.2 Authorizing Database Writes 


UrFlow also checks every database modification. For 
example, consider this page generation function, which 
would be given as the action to run upon submission of 
an HTML form for adding a new secret. 


fun addSecret (fields) = 

case userlId() of 
None => write("You’re not logged in.") 

| Some u => 
let id = nextId() in 
dml (INSERT INTO secret 

VALUES ({id}, {u}, 

main () 


(Id, User, 








If we do not assert an explicit database update policy, 
then UrFlow rejects this program. Here is one policy that 
would allow the insertion: 








policy mayInsert (SELECT * 
FROM secret AS New, user 
WHERE New.User = user.Id 

AND known (user.Pass) 
AND known (New.Data) ) 




















We reuse the same SQL query notation for modifica- 
tion policies, though the choice of SELECT clause is ig- 
nored, so we will always write SELECT *. One of the 














Data) 
{fields.Data})); 
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tables in the FROM clause must be given the name New; 
this is the table for which we are authorizing insertion. 

UrFlow only allows a row insertion if the new row 
could be returned by one of the mayInsert queries, 
in a certain sense. In checking against a particular policy 
query, we interpret the New relation as the universal rela- 
tion, containing all possible tuples. The policy may join 
it with other, real database tables and perform filtering 
with WHERE, leading to a result set of rows that may be 
infinite. The insertion is permitted if the New part of one 
of these rows matches the values being inserted. 

Our insertion policy lets any user add secrets if he as- 
sociates them with his own user. We can also authorize 
deletions and updates, based on similar criteria. 











policy mayDelete (SELECT * 
FROM secret AS Old, user 
WHERE Old.User = user.Id 

AND known (user.Pass) ) 





























policy mayUpdate (SELECT « 











FROM secret AS Old, secret AS New, user 
WHERE Old.User = user.Id 

AND New.User = Old.User 

AND New.Id = Old.Id 

AND known (user.Pass) 

AND known (New.Data) ) 





A mayDelete policy must tag a FROM table as 01d, 
to stand for the table being deleted from. A mayUpdate 
policy needs both O1d and New tables, standing for the 
part of a table being updated and the new data being writ- 
ten into it. Both new policies retain the logic for checking 
that the client knows the password for the user whose se- 
cret is affected, and the update policy also requires that 
the secret ID is not changed. The insertion and update 
policies require that the new data value is known, which 
provides a simple guard against inadvertent leaking of 
privileged information into a part of the database that is 
considered to be less privileged. 


3 Flexibility of Query-Based Policies 


We have found that this approach to writing specifica- 
tions leads to natural descriptions of many natural poli- 
cies. For instance, we have implemented a simple Web 
message forum system. Our implementation contains a 
table representing an access-control list. Each entry gives 
a user permissions in a specific forum, at a particular nu- 
meric level of access. 


table acl forumld, 


userlId, Level 


{ Forum 


User int } 


One policy allows release of information about any 
message in a forum that the current user has been granted 
any kind of access to. 











policy sendClient (SELECT * 
FROM message, acl, user 
WHERE acl.Forum = message.Forum 
AND acl.User = user.Id 
AND known (user.Pass) ) 

















Posting a new message requires access at level 2 or 
higher. 











policy mayInsert (SELECT «x 
FROM message AS New, user, acl 

















WHERE New.User = user.Id 
AND New.Forum = acl.Forum 
AND user.Id = acl.User 
AND known (user.Pass) 

AND acl.Level >= 2 
AND known (New. Subject) 
AND known (New. Body) ) 


Regular users may not delete messages from forums. 
This right is only granted to admins, who have access 
level 3 or higher. The following policy formalizes the 
deletion rule. 











policy mayDelete (SELECT * 
FROM message AS Old, user, acl 
WHERE Old.Forum = acl.Forum 
AND user.Id = acl.User 
AND known (user.Pass) 
AND acl.Level >= 3) 

















Our implementation allows forums to be marked as 
public, in which case any visitor may read their contents. 
There is also another ACL table which grants users ad- 
min access to all forums. Additional policies allow in- 
formation flows and updates based on these rules. 


The UrFlow policy language supports access control 
techniques besides user accounts with passwords. For 
example, we have implemented a simple Web-based poll 
system without user accounts. Anyone may create a new 
poll; at that time, the creator learns a secret code that 
grants admin rights to the poll. That code allows him to 
add poll questions. After adding all of the questions, the 
poll creator may mark the poll as live. After that time, 
no further changes to the poll are allowed, and the poll 
is added to a list on the application’s front page. Anyone 
may vote in a live poll, but no one may vote on a poll that 
is not yet live. After submitting his votes, a user receives 
a code that allows him to view the results of the poll. 
Results should never be released without first checking 
that the user has provided a code that matches the poll 
admin code or a code associated with a vote that has been 
cast. 

The policy below controls the conditions under which 
a new question may be added to a poll. In particular, 
the question must be linked to a valid poll, the user must 
know the admin code for the poll, and the poll must not 
be live yet. 
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policy mayInsert (SELECT * 
FROM question AS New, poll 
WHERE New.Poll = poll.Id 

AND known (poll.Code) 
AND NOT poll.Live 
AND known (New. Text) ) 

















Anyone with a poll’s admin code may update the poll 
only to mark it as live. This policy expresses that re- 
quirement with equality assertions between old and new 
values of every column besides Live. 











policy mayUpdate (SELECT *« 














FROM poll AS New, poll AS Old 
WHERE New.Id = Old.Id 

AND New.Nam = Old.Nam 

AND New.Code = Old.Code 

AND New.Live 

AND known (Old.Code) ) 





We allow release of information about answers to a 
poll, whenever the user proves he already voted in that 
poll by providing a code associated with an appropriate 
answer set. 











policy sendClient (SELECT ~« 
FROM answer, answers AS Other, 
answers AS Self 














WHERE answer.Answers = Other.Id 
AND Other.Poll = Self.Poll 
AND known (Self.Code) ) 





We believe that this specification approach is very 
general, while being much more accessible to the av- 
erage developer than most specification languages are. 
To investigate the potential for static analysis based on 
these specifications, we implemented the UrFlow pro- 
totype, which handles a restricted subset of all SQL 
queries. In particular, in both policies and programs, 
we only process queries containing just SELECT, FROM, 
and WHERE clauses, where the FROM clauses must be 
simple comma-separated lists of tables. We also have 
not implemented any analysis optimizations like proce- 
dure summaries [19], and the analysis only succeeds at 
understanding loops and recursion following a few sim- 
ple patterns. 

Perhaps surprisingly, this is enough to enable sound 
checking of a variety of paradigmatic Web applications. 
We will now describe the analysis and then argue for its 
effectiveness with statistics about a set of representative 
applications that it has validated. 

















4 An Outline of the Analysis 


Sound program checking requires considering all possi- 
ble paths of execution. Since most any non-trivial Web 
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application can effectively follow infinitely many paths, 
we must apply some abstraction. In implementing Ur- 
Flow, we adopted the strategy associated with tools like 
ESC [10], the Extended Static Checker family. 

While concrete program evaluation involves program 
states consisting of variable values, memory states, and 
so on, the kind of symbolic evaluation that we apply 
involves program states consisting of formulas of first- 
order logic. Such a formula can be thought of as describ- 
ing concrete states, so that each abstract state may stand 
for infinitely many concrete states. Every basic program 
operation can be modeled as a predicate transformer. 
Some operations may not always be safe. In the classical 
setting, this may be an array dereference, where the in- 
dex might be out of bounds. In our case, possibly-unsafe 
operations include write commands and database up- 
dates. No matter which setting we are in, the safety of 
operations is checked by associating each operation with 
a logical condition that implies its safety. 

This gives us the outline of a sound checking proce- 
dure: Start with the abstract state “true.” Explore all pro- 
gram paths, extending the abstract state as we go. Each 
time we reach an operation with safety condition C’ while 
in state S, ask an automated theorem prover whether 
S = C. The ESC projects used the Simplify prover [8] 
for this purpose. Today, the functionality provided by 
Simplify is most commonly known by the name SMT, 
for satisfiability modulo theories, and there is a rich base 
of tools and users in the domain of static program check- 
ing. 

Our outline omits a critical element of the problem: 
Even after abstracting program states with formulas, 
there are probably still infinitely many feasible program 
paths. The ESC approach requires additional program 
annotations that can be used to finitize the path space. In 
the design of UrFlow, we have instead taken advantage of 
the control-flow simplicity of the average Web applica- 
tion. Many interesting applications can be implemented 
with just one kind of loop: iteration over writing some 
output for every row returned by an SQL query. Such 
loops effect no state changes that must be taken into ac- 
count in the remainder of the program, so in a sense they 
have trivially inferable “loop invariants.” Since loop iter- 
ation does not accumulate side effects, it is sound to tra- 
verse each loop body just once, which ensures that each 
program can be broken into a finite set of finite analysis 
paths. 

UrFlow thus works by literal exploration of all con- 
trol flow paths through a program. The next section goes 
into more detail on the exploration strategy, pointing out 
the theorem prover operations that will be required. The 
following section presents our implementation of those 
prover primitives, in an engine that extends the standard 
SMT approach with a few new features. 
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5 Symbolic Evaluation 


The abstract states of UrFlow are defined in terms of a 
simple language of logical expressions and predicates. 
We write c for constants (drawn from integer, floating 
point, and string literals), 7’ for union tags, x for logi- 
cal variables, X for program variables, F’ for record field 
names, and F for SQL table names. The following gram- 
mar describes the syntax of program states. For a token 
sequence t, we write ¢ for a comma-separated list of zero 
or more ts. 





Expression e := cla|T(e)|{F=e}l|eF 
Predicate p ::= known(e)| R(e)|e=e]|... 
State S = (p,X re) 


A state is a pair of a variable assignment and a set 
of predicates. For a particular program point, a variable 
assignment maps every in-scope program variable into a 
logical expression. The predicates are expressed only in 
terms of logical variables, not the program variables. 

Since we inline all function calls, every execution path 
to analyze begins at the entry point of some function 
that has been registered to be called in response to a 
particular URL pattern. The arguments to this func- 
tion stand for explicit parameters and form field val- 
ues, extracted from an HTTP request. Where the func- 
tion arguments are named X;, we create an initial state 
(known(x;), X; +4 x;), for fresh, distinct variables x;. 
At many other points in path exploration, we will gen- 
erate fresh logical variables, which we always assume to 
be distinct from any previously-chosen variables. 

For each function, we explore all paths through it. 
Most program expression forms are easy to process, as 
they admit direct translation into logical expressions. 
The more interesting cases come from branching and 
database interaction. 

Our single branching construct is case expressions, 
which test a value against a number of patterns, which 
may bind new variables if they match. We model if 
expressions as a special case of case expressions, where 
the patterns to match against are t rue and false. 

As an example, consider an expression like the follow- 
ing: 


case e of None => el | Some(X) => e2 


If e is just the tag None, then we continue with eval- 
uating e1. Otherwise, e is Some v for some v, and we 
evaluate e2 with X set to v. To capture this with sym- 
bolic evaluation, we consider both e1 and e2 as starts of 
separate execution paths. For the e1 case, we extend the 
state with the predicate v = None, where v is the result 
of evaluating e. For the e2 case, we choose a fresh vari- 
able x, add the variable mapping X +> z, and add the 
predicate v = Some(r). 


With case, it is easy to write code with exponentially 
many control-flow paths, but where all but a few are log- 
ically impossible. For instance, we can sequence several 
case expressions that analyze the same program vari- 
able with the same patterns. Variables are immutable, so 
each case must choose the same pattern, reducing the 
number of feasible paths to the number of patterns. We 
want our automated theorem prover to detect the infeasi- 
bility of the other paths as early as possible. Concretely, 
this will happen on a path where two cases lead to as- 
sertions like y = None and v = Some(zx), on a path 
that assumes matching of a None pattern the first time 
and a Some pattern the second time. The prover knows 
that values built with different union tags are disjoint, so 
it can signal a contradiction here. Whenever a contra- 
diction is detected at some point on a path, we can skip 
exploring the rest of that path. 


A number of primitive operations send output to the 
client. The simplest of these is write, which appends a 
piece of HTML to the page being generated. UrFlow en- 
forces that the value being sent can be constructed from 
known and allowable pieces of data. Recall that allow- 
able values are those that could be produced by execut- 
ing sendClient policies in the current database state. 
Consider this line of our earlier example program: 


write (escape (r.Secret.Data)); 


The record r has come out of a database query. To 
verify that this write conforms to the policy, we must 
check that r.Secret.Data is known, allowable, or 
built from such values out of record and union opera- 
tions. At this point in symbolic execution, the variable 
mapping will map the program variable r to some logi- 
cal variable r, and our predicate set will be: 


c = cookie/login, known(c), c = Some(c’), user(x1), 
x1.Id = c’.ld, x,.Pass = c’.Pass, 

secret(x2), %2.User = c’.Id, 

r = {Secret = {Id = xr2.1d, Data = x2.Data}} 


The state tells us that we know of two rows that must 
exist in the database: x1; from table user and x2 from 
table secret. Each of our declared confidentiality poli- 
cies is phrased as a SELECT query whose FROM clause 
mentions one or more tables. To check if a value may be 
written, we need to consider ways of matching the pol- 
icy queries with the logical state. The same table may 
be mentioned multiple times in one policy or one state, 
So, in general, there may be many ways to match a pol- 
icy’s FROM clause with the table predicates of a state. In 
UrFlow, we apply the heuristic of considering at most 
one matching per policy. The analysis enumerates every 
matching of policies with row variables, subject to that 
constraint. 

Our running example included these two policies: 
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policy sendClient (SELECT «* 
FROM user 
WHERE known (user.Pass) ) 


























policy sendClient (SELECT * 
FROM secret, user 
WHERE secret.User = user.Id 
AND known (user.Pass) ) 

















They can be expressed in logical form, where each is 
a set of predicates that, if all are true, implies the allowa- 
bility of a set of values. 





Predicates: user(71), known(r;.Pass) 
Values: 1 1.1d,7r1.Nam, 71.Pass 
Predicates: user(r ), secret(72), known(r1.Pass), 
rg.User = r1.I1d 
Values: 1 ,.Id,r1.Nam, 11.Pass, r2.ld, 


r2.User, r2.Data 


Matching a policy against a state is a two-step process. 
First, we consider a mapping of the policy’s r; row vari- 
ables to variables appearing in the state. For any table 
predicate R(r;) appearing in the policy, we try setting r; 
to x, for any R(x) appearing in the state. Once we have 
found a plausible mapping for every policy row variable, 
we apply that mapping to the remaining predicates in the 
policy. If the theorem prover verifies that the state im- 
plies every one of these predicates, then we have found a 
viable policy instantiation, and we can continue match- 
ing the remaining policies. We repeat the process to try 
every combination of instantiating every policy at most 
once. 

For every set of policy instantiations, we compute the 
set of expressions that those policies say are fair game 
to write. Our running example has exactly one feasible 
instantiation per policy: every policy variable in user 
unifies with x1, and every policy variable in secret 
unifies with x2. The remaining predicates are all implied 
by the state. Most interestingly, we must verify that the 
state implies known(x1.Pass), which follows by reason- 
ing from this subset of the state predicates: 


known(c), c = Some(c’), 71.Pass = c’.Pass 


The reasoning goes like this: Because the union value 
c is known, its contents c’ are known, too. Because the 
record c’ is known, its field Pass is known. That field is 
asserted equal to the value x1.Pass that we want to prove 
known, so we are done. The theorem prover provides a 
complete decision procedure for reasoning chains of this 
kind. 

Having verified correct instantiation of each policy, we 
arrive at this set of allowable expressions: 


x1.1d,71.Nam, 71.Pass, r2.Id, r2.User, r2.Data 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


We are trying to prove that the expression 
r.Secret.Data is allowable, which requires proving 
that it is equal to one of the above expressions. It turns 
out that our state implies that the written value equals 
X_.Data, because the state contains this predicate: 


r = {Secret = {ld = x2.1d, Data = x2.Data}} 


That completes the check for this write operation. 
The procedure scales to handling much more compli- 
cated cases, and we also apply the same procedure to any 
expression used in a branching construct, such that the 
result of the test influences what is written to the client. 
Especially in this latter case, we need to be able to rea- 
son about values that are neither known nor allowable, 
but that are built from such values via record and union 
operations. Our theorem prover handles the automation 
of that kind of reasoning, too. 


The heart of symbolic evaluation is the treatment of 
database queries. Recall the form of queries, as illus- 
trated by the main output loop of our example applica- 
tion. 


























query (SELECT secret.Id, secret.Data 
FROM secret 
WHERE secret.User = {u}) 
(r acc => ...) () 


We execute an SQL query, which may contain injected 
program values, and loop over the result rows. An accu- 
mulator is initialized to some specified value, which here 
is the dummy value (), since we execute this loop body 
only for side effects. Every iteration runs the loop body 
with r bound to the latest result row and acc bound to 
the current accumulator. After an iteration, the accumu- 
lator is replaced with the value of the . . . body expres- 
sion. 

Traditional verification tools require manual annota- 
tion of loops with invariants, to help tame the unde- 
cidability of the program analysis problem. To avoid 
that cost, we designed UrFlow around some observations 
about the loops that appear in practice in Web applica- 
tions. Most are run solely for their side effects of writ- 
ing content to the client, so that there is no need to track 
state changes from iteration to iteration. Ur/Web vari- 
ables are all immutable, so it is not even possible for 
them to change across iterations. Side effects are re- 
stricted to database tables and cookies, which tend not 
to be used in the same way that variables are used in tra- 
ditional imperative languages. All this implies that a sim- 
ple loop traversal strategy can be very effective: traverse 
each loop body only once. 

Concretely, when we reach a query in a symbolic ex- 
ecution path, we consider two possible sub-paths. First, 
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the query may return no results, in which case we pro- 
ceed taking the initial accumulator as the final value. 

More interestingly, the loop may execute one or more 
times. We perform a quick linear pass over the body 

. to see which cookies it might set and which tables 
it might modify with SQL UPDATE or DELETE com- 
mands. All references to those cookies and tables are 
deleted from the symbolic state. Since all other aspects 
of concrete state are immutable, this new logical state is 
guaranteed to be an accurate description of the concrete 
state at the beginning of any iteration of the loop. Thus, 
by running the loop body with its local variables set to 
fresh logical variables, we consider all possible behav- 
iors of the loop. We can continue execution afterward as 
if we had just executed the loop body once as normal, 
non-loop code. The symbolic state at loop exit can just 
as well stand for the last iteration of the loop as for any 
other iteration. 

At the beginning of a loop iteration, we must enrich 
the logical state with predicates capturing the behavior of 
the query. This is best illustrated by example. Consider 
again the main loop of our example application. We ex- 
ecute its loop body with variable r set to r and acc set 
to some arbitrary value (since the accumulator is not ref- 
erenced in the body). Assume that program variable u is 
mapped to logical variable u. We add these predicates to 
the logical state: 














secret(x2), %2.User = wu, 
r = {Secret = {Id = x2.1d, Data = x2.Data}} 


Queries with joins just add more table predicates, as 
we have seen in the modeling of queries as policies. 
Larger WHERE conditions add additional non-table pred- 
icates. A SELECT clause determines which fields to 
project from the tables, in building the record expression 
to equate with r. 

This basic algorithm works for most of the queries that 
we support. In general, UrFlow does not yet support SQL 
grouping or aggregation. We include one special case for 
queries selecting just the aggregate function COUNT (x). 
Here, we consider that the loop body always iterates ex- 
actly once. Either the query result is 0, and we do not 
enrich the state with any new table information; or the 
result is greater than 0, and we assert that there exists 
some set of rows matching the conditions of the query. 

To check database updates, we use a hybrid of the 
query and write checking. Any modification must match 
with an update policy, using the same matching proce- 
dure as for writes, but without the need to check al- 
lowability of a value. After an UPDATE or DELETE, 
we delete any state predicates mentioning the affected 
tables. 

UrFlow also has basic support for simple recursive 
functions. Calls to recursive functions are effectively in- 



















































































Figure 1: E-graph for the state from the write example 


lined like regular function calls, with further self-calls 
skipped. To make this omission sound, we analyze each 
recursive function to find all effects it might have on the 
database and cookies, and every self-call is treated as a 
nondeterministic modification of those parts of the state, 
followed by generation of an unknown return value. Fur- 
ther analysis allows us to abstract the initial state so that 
it can stand for any set of arguments that might be used at 
any recursion depth, such that we only preserve state in- 
formation that can be shown not to vary across calls. As 
a result, just like for query loops, a single pass over the 
function body suffices to consider all possible behaviors. 

We want to emphasize some useful consequences of 
the way that our analysis handles SQL. First, unlike in 
some related work [14], despite the fact that our poli- 
cies are themselves SQL queries, the analysis does not 
require that program code use exactly those queries. Se- 
mantic modeling of queries makes it possible for one 
policy query to justify infinitely many possible program 
queries. Second, the soundness of our analysis depends 
on knowledge of the database schema, but not knowl- 
edge of database contents. Schema changes can invali- 
date analysis results by, for example, redefining data in- 
tegrity constraints that the theorem-prover might have re- 
lied on. However, arbitrary changes to the database rows, 
by arbitrary programs with no relation to UrFlow, cannot 
invalidate past analysis results. 


6 The Theorem Prover 


The last section highlighted the key theorem-prover op- 
erations that symbolic evaluation depends on. We can 
summarize them like this: 


e Assert a predicate p. If p contradicts the predicates 
already asserted, raise an exception indicating so. 


e Check if a predicate is implied by those already as- 
serted. 


e Determine if a logical expression can be constructed 
from members of a set of allowable expressions. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) —:113 


114 


The first two points are supported by the classic model 
of first-order logic theorem-proving that is embodied in 
tools like Simplify [8]. The third point is new and not di- 
rectly supported by usual prover interfaces, but the usual 
implementation techniques can support it very directly. 

Provers like Simplify are based on the Nelson-Oppen 
architecture. We do not use many of the elements of that 
architecture, since our prototype implementation omits 
features like reasoning about arithmetic. Instead, we just 
adopt the key data structure, the E-graph. An E-graph 
is a directed graph representation of the possible worlds 
that are consistent with a set of predicates. Nodes stand 
for objects, and, for function symbol f, an edge labeled 
with f goes from node wu to node v if, in any compatible 
world, the object associated with v equals the result of 
applying f to the object associated with u. A node is 
labeled with logical variables and constants to indicate 
that any compatible world must assign this node to an 
object equal to those variables and constants. 

In UrFlow, we only use two kinds of function sym- 
bols: union tags and record field names. For tag 7’, there 
is a T-labeled edge from u to v if v must be u tagged 
with T (i.e, “vu = T(u)”). For field name F’, there is 
an F’-labeled edge from u to v if u is a record whose Ff’ 
component equals v. For each node that came from a lit- 
eral record expression, we mark that node as complete, 
in the sense that the field edges coming out of it provide 
a complete description of the available fields. An exam- 
ple of an incomplete record node is one representing a 
row selected in an SQL query; the state will only men- 
tion those columns relevant to the query, and it would be 
unsound to treat this row as if it had no further columns. 

Figure 1 shows an E-graph representing the 
logical state given earlier for checking the code 
write (escape (r.Secret.Data)). Nodes are 
boxes when the state implies that they are known; other 
nodes may not be known. Complete record nodes are 
diamonds. We abbreviate cookie/ login as C. 

The basic prover algorithm understands two kinds of 
predicates: e; = e2 and known(e). When either kind is 
asserted, its expressions are first evaluated into nodes of 
the E-graph, adding new nodes as necessary. A variable 
or constant is evaluated to the node labeled with it. A 
union tag application T'(e) is evaluated by following the 
T edge from the node that e evaluates to, and a field pro- 
jection e.F is evaluated analogously. A record expres- 
sion {F, = e€1,...,/, = en} is evaluated by checking 
for existing complete nodes whose F’; edges point to the 
nodes to which the e;s evaluate. 

When a fact e; = eg is asserted, the nodes wu; and ug 
standing for e; and e2 are merged, taking the unions of 
their sets of labels and incoming and outgoing edges. Al- 
ternatively, this fact might trigger a contradiction. That 
happens when uw; and wz are labeled with different con- 
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stants or have incoming tag edges labeled with different 
tags. 

When a fact known(e) is asserted, and e evaluates to 
u, we “change u to a box,” and we propagate this known- 
ness information across edges. That propagation follows 
record field edges in the forward direction only and tag 
edges in either direction. The same propagation is im- 
plied when merging a known node with a not-known 
node for an equality assertion. 


The heart of the procedure is in this handling of as- 
sertion. E-graphs have nice properties which make im- 
plication checking very efficient. To check if e; = eg, 
we only check if e; and eg evaluate to the same node. 
To check if known(e), we only check if e evaluates to a 
boxed node. 

One useful addition, implemented outside of the theo- 
rem prover core, takes advantage of key information for 
SQL tables, where, for instance, an ID column is as- 
serted not to be duplicated across rows of a table, and 
the SQL engine maintains this invariant with dynamic 
checks. Whenever a new predicate asserts that some row 
r is in table R, we check, for every pre-existing predicate 
R(r’), if r and r’ agree on the values of R’s key columns. 
These checks can be implemented by querying the prover 
core with the appropriate equality predicates. Whenever 
a matching r and r’ pair is found, we can skip adding the 
new predicate R(r) to the state, instead asserting r = r’. 
This enrichment of the prover is useful in analyzing ap- 
plications that, for example, query a user/password table 
multiple times, where correctness relies on the fact that 
the query always returns the same result. 

The last ingredient is checking if the value of expres- 
sion e can be constructed out of the values of expressions 
€1,-..,€n, using only record and union operations. To 
implement the check, we evaluate each e; in turn, mark- 
ing its node as allowable. Next, we evaluate e to a node 
u. If wis marked as allowable, we are done. Otherwise, 
if u has an incoming union tag edge from a node v, we re- 
peat the procedure for v. If u is a complete record node, 
we repeat the procedure for each target of a field edge 
out of u, returning success only if the check is successful 
for each of these new nodes. In any other case, we return 
failure. 


7 Discussion 


We can get a sense for the breadth of UrFlow by con- 
sidering how it helps with the most common Web appli- 
cation security flaws. The OWASP Top 10 Web Appli- 
cation Security Risks project! is a popular reference for 
security-conscious Web developers. Based on analysis 


'http://www.owasp.org/index.php/Category: 
OWASP_Top_Ten_Project 
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of databases of real vulnerabilities, the OWASP team has 
identified which classes of security flaw pose the greatest 
risks. The Ur/Web compiler rules out injection (ranked 
#1) and cross-site scripting (#2) vulnerabilities and par- 
tially mitigates cross-site request forgery (#5) and unval- 
idated redirects and forwards (#10) using techniques un- 
related to UrFlow. Risk #6, security misconfiguration, is 
a whole-system property that cannot really be addressed 
by any single tool, and UrFlow’s lack of integrated rea- 
soning about cryptography prevents it from helping to 
avoid insecure cryptographic storage (#7). UrFlow can 
contribute to the mitigation of the remaining risk cate- 
gories. 


Risk #3, broken authentication and session manage- 
ment, is helped by the ability to use UrFlow policies to 
specify exactly which secure tokens may be sent to which 
clients. It is still possible to make mistakes in the poli- 
cies, but these policies should be significantly easier to 
audit than programs, with the many possible control-flow 
paths of the latter. The next two risk categories, insecure 
direct object references (#4) and failure to restrict URL 
access (#8), are very similar, as both involve the omission 
of access control checks for particular system objects. 
UrFlow can enforce that appropriate checks are always 
performed whenever database objects are used in par- 
ticular ways. Insufficient transport layer protection (#9) 
could be avoided by adding a variant of sendClient 
policies which specifies values that may only be sent to 
clients over SSL connections. 

Comparing against the pros and cons of security 
types [16], we find some interesting trade-offs. UrFlow 
uses high-level knowledge of programs to provide more 
sound reasoning without program annotations. Security- 
typed languages generally rely on declassification tech- 
niques where trust is granted to particular spans of code. 
This creates a contrast between the security-typed ap- 
proach, requiring trusted code but granting soundness 
with respect to implicit flows; and the UrFlow approach, 
which requires no trusted Ur/Web functions but ignores 
some implicit flows. Security type annotations tend to be 
required throughout a program, while UrFlow avoids the 
need to mark up program code. However, SQL queries as 
policies involve some gotchas that would be less applica- 
ble to security types. For instance, it is easy to forget all 
or part of a policy WHERE clause, which has the unfortu- 
nate consequence of allowing behaviors by default. 





The problem of implicit flow checking is a serious one 
in all kinds of information flow analysis. Where Ur- 
Flow checks implicit flows, the checking is not particu- 
larly clever, and implicit flows caused by WHERE clauses 
are ignored. Future work may be able to plug part of 
this hole statically, and we suspect there will also be a 
large role for dynamic monitoring systems, for detecting 
brute-force password cracking attempts and other attacks 














that involve many HTTP requests. 

Many different logical languages have been used for 
specification-writing in static verification tools. We 
found SQL to be a convenient choice, because it is ex- 
pressive enough to allow direct expression of interesting 
policies, and declarative enough to enable effective auto- 
mated reasoning. We do not mean to claim that SQL 
has great expressivity or succinctness advantages over 
more traditional specification languages. Rather, most 
Web programmers are accustomed to SQL, which should 
help in overcoming some of the social obstacles faced in 
the past by attempts to get programmers to write logical 
specifications. 

Our implementation today only handles a subset of the 
common SQL features. We omit support for outer joins. 
These should be easy to model via disjunctive formulas, 
covering all the possible cases of whether a row match- 
ing the join condition exists in a table, though a naive 
realization of this idea would probably have poor perfor- 
mance consequences for the theorem-prover. Grouping 
and aggregation are harder to encode in the quantifier- 
free first-order logic that we are employing. We sus- 
pect that most real programs can be checked with con- 
servative encodings of aggregation, where we model ag- 
gregate function values as unknowns. Alternatively, we 
can restrict reasoning about aggregate functions to sim- 
ple syntactic pattern-matching against policies. That ap- 
proach also seems most practical for handling of the SQL 
EXCEPT operator, which implements a kind of negative 
reasoning about which rows do not exist. This is needed 
to write down policies like (for a conference manage- 
ment system) “reviewer A may see the reviews for paper 
B only if A does not have a conflict with B.” 

More advanced policies might also need to include 
non-trivial program code. For instance, a custom hash- 
ing or encryption scheme might be used. Here we en- 
counter a common situation for static verification, where 
itis always possible to expand the reach of your theorem- 
prover to handle new program features. No single imple- 
mentation will ever be able to handle all realistic pro- 
grams, but we suspect that very good coverage will be 
possible, after the incorporation of significant practical 
experience with the tool. 














8 Evaluation 


The UrFlow prototype is implemented in about 2200 
lines of Standard ML code. We have used the analy- 
sis to check a number of Ur/Web applications. There 
is a live demo of the applications, with links to syntax- 
highlighted source code, at: 


http://www. impredicative.com/ur/scdv/ 
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Application Program Policies Check 
(LoC) (LoC) (sec) 

Secret 138 24 0.02 
Poll 196 50 0.035 
User DB 84 8 - 
Calendar 255 46 0.28 
Forum 412 134 17.68 
Gradebook 342 61 1.49 

















Figure 2: Lines-of-code breakdown in case studies, with 
time required to check the code with UrFlow 


Our case studies include Secret, a minimal applica- 
tion for storing secrets that may later be retrieved via 
password authentication, which was used as the model 
for this paper’s first set of running examples; the Forum 
and Poll applications from which Section 3’s examples 
were drawn; a Gradebook application, for managing a 
database of student grades in courses; and a reimplemen- 
tation of the Calendar application from the paper [5] that 
introduced the SIF system for combined static and dy- 
namic checking of information flow in Web applications. 
Calendar, Forum, and Gradebook share a common user 
authentication component. 


The Calendar application lets users save details of 
their schedules on the Web, with controlled sharing of in- 
formation. By default, no one may learn anything about 
an event. The creator of an event may learn everything 
about it, and the creator may add invitees who inherit 
the same read privileges. The creator may also authorize 
users to know only the time of an event, so that those 
users see that time slot only as “busy” on the creator’s 
calendar. Only event creators may modify any state re- 
lated to their events. 

The Gradebook application is based on a database of 
courses and assignments of users to be instructors, teach- 
ing assistants (TAs), or students in courses. Each student 
membership record contains an optional grade. Only sys- 
tem administrators may create courses and modify in- 
structor lists. Instructors may set grades and control TA 
assignments. A TA may view all of the state associated 
with a course, but may not modify it. A student may view 
his own grades, and a student in a course may only affect 
that course’s part of the database by dropping the course. 

Figure 2 gives the number of lines in code in each 
of these components. An application’s code is sepa- 
rated into the program itself and the policies. The fig- 
ures here make “policy overhead” appear bigger than it 
would probably be in production applications, since our 
case studies include minimal code dedicated to provid- 
ing fancy user interfaces. Still, these numbers compare 
favorably to those for systems like SIF, where Calendar 
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requires 1779 lines of code. While we have a similar 
ratio of program to annotation, our annotations are of a 
different kind. 443 lines of the SIF version include an- 
notations, in the form of security types [20] and explicit 
downgradings. The latter involve annotations that effec- 
tively say “the owner of a piece of information trusts this 
span of code, so let that span release derived information 
that would not otherwise be allowed.” The SIF Calendar 
case study includes 17 such downgrades. 


The UrFlow approach is very different. As no annota- 
tions are required in programs, there is no need to accept 
any part of a program as trusted. All checking is with 
respect to the declarative specification provided by the 
policy queries. 


Our analysis detects flaws similar to those that occur 
frequently in real deployed systems. For instance, we 
examined reports for July 2010 in the National Vulner- 
ability Database”. Among the relevant issues, we found 
CVE-2009-4927, involving privilege escalation via a sur- 
prising setting of a specific cookie; and CVE-2010-2685 
and CVE-2009-4929, which allow administrative actions 
to be taken without proper credentials, via hand-crafted 
HTTP requests. UrFlow makes it easy to catch these 
problems, since it is not necessary to enumerate all pos- 
sible attack vectors, thanks to policies that talk directly 
about underlying resources. For instance, we introduced 
a bug in the Gradebook application to mimic the cookie 
bug above, where we allow anyone to set any student’s 
grade if a particular cookie is set to 1. The compiler com- 
plains that the database update policy may be violated, 
referencing the exact span of source code where the of- 
fending UPDATE statement occurs. The same output ap- 
pears if we simulate a forgotten access control check, in 
the style of the second two issues above, by commenting 
out an important i f test. 





UrFlow also requires no change to the runtime behav- 
ior of a program, and this baseline performance level 
is greater than for most popular Web languages and 
frameworks, thanks to the general-purpose and domain- 
specific optimizations performed by the Ur/Web com- 
piler. We present the performance of the UrFlow anal- 
ysis itself in Figure 2, for runs on a Linux machine with 
dual 1 GHz AMD64 processors with 2 GB of RAM. Of 
our case studies, only Forum takes much longer than a 
second to check. This is because Forum has a compli- 
cated main function, with many security checks. Many 
different actions call the main function after perform- 
ing some database modification. Every such call is an- 
alyzed afresh, as if the main function had been inlined. 
Techniques like procedure summaries [19] should make 
it possible to reduce this time significantly. 


*http://nvd.nist.gov/ 


USENIX Association 


USENIX Association 


Very precise, logic-based program analyses often ex- 
hibit bad scaling behavior. There is no theoretical rea- 
son that UrFlow would not run into the same problems. 
Many programs with exponentially many feasible paths 
will indeed trigger exponential behavior in any realiza- 
tion of our algorithm. Simple experiments with param- 
eterized families of programs also show that our current 
implementation produces exponential running time (with 
small constant factors) even on some examples that can 
probably be reduced to linear running time with more op- 
timization. For instance, we tested programs made up of 
if-trees that perform the same SQL query at each of the 
tree’s exponentially-many nodes. Primary key informa- 
tion implies that the if test always goes the same way, 
ruling out all but two paths through the tree. Still, expo- 
nential time usage results from our heuristic of consid- 
ering two execution paths starting at each query, for the 
cases of zero or more than zero result rows. Much future 
work remains in smarter detection of redundant paths. 


9 Related Work 


The BAN logic [2] is a formal system for reasoning about 
knowledge in distributed system protocols. The rules of 
the logic model important aspects like transitive trust and 
cryptography. The spi calculus [1] pursues similar goals, 
introducing an explicit formalization of programs, rather 
than just of the knowledge that principals have at points 
throughout a protocol. Our known predicate is modeled 
on notions introduced in that line of work. 

Security types [20] are a technique for static checking 
of information flow based on explicit data labels such 
as “high security” and “low security.” The JFlow [15] 
and Jif [16] systems are realistic implementations of se- 
curity typing for Java. SIF [5] extends Jif for the Web 
application domain. This line of work enables check- 
ing of a much broader range of applications than UrFlow 
can handle. By focusing on a narrow domain that nat- 
urally supports declarative implementation techniques, 
UrFlow is able to do sound checking without requir- 
ing any program annotations. Jif-based systems require 
many annotations, including explicit granting of trust 
to particular spans of code. The Swift system [4] ex- 
tends this approach to do automatic, secure partitioning 
of Web application code across client and server, based 
on information-flow constraints. 

Li and Zdancewic [14] presented a system for static 
checking of information-flow properties for database- 
backed Web applications. Their design requires that 
the application be programmed in terms of fixed sets of 
query templates with holes to be filled with different val- 
ues on different invocations. Every template is annotated 
with security typing information for each input and out- 
put. In contrast, UrFlow infers the security-relevant char- 


acteristics of queries from a declarative policy. One pol- 
icy may be enough to imply the sensitivity of outputs 
from many different query forms. UrFlow also applies 
theorem-proving technology to allow sound checking of 
more programs, including those where policies vary dy- 
namically based on database contents. 

Asbestos [9] and HiStar [23] are operating systems 
with support for dynamic enforcement of the Decentral- 
ized Information Flow Control model, which specifies 
which run-time flows between sensitive objects to allow. 
The Flume system [13] implements similar functionality 
on top of standard UNIX abstractions. All of these sys- 
tems can support complex system architectures that fall 
outside the specialized orientation of UrFlow. Flume has 
been used to build a secured version of the MoinMoin 
wiki application. This port to Flume required about 1000 
lines of new code and 1000 lines of modifications, and a 
performance cost between 34% and 43% was measured, 
against the baseline of interpreted Python code. Our Fo- 
rum case study demonstrates that UrFlow can check poli- 
cies based on access control lists, which are the main 
property enforced in the Flume case study. 

The Resin system [22] implements a much lighter- 
weight approach to Web application security. Instead of 
relying on a fixed label model, Resin allows program- 
mers to implement their own property checks in the lan- 
guage in which the application is written. Policy code 
may tag values with policy objects, and the Resin system 
takes care of flowing these policies through the system 
and checking them at points where the application inter- 
acts with its environment. Compared to the other systems 
we have mentioned, including UrFlow, Resin makes it 
much easier to add security checking to existing appli- 
cations written in popular scripting languages like PHP 
and Python. Resin’s lightweight policy approach can 
also express policies that UrFlow’s policy queries can- 
not. On the other hand, once a programmer has learned 
Ur/Web and used it to implement his application, UrFlow 
requires little annotation and brings the standard bene- 
fits of static analysis, compared to Resin and the systems 
mentioned in the previous paragraph: we get once-and- 
for-all security guarantees, without the possibility of the 
application being aborted because a problem is detected 
at run-time; and we avoid extra run-time costs, such as 
the 33% CPU overhead reported for a representative PHP 
application instrumented with Resin. 

Much work on Web application security focuses on 
injection attacks, where bugs allow untrusted user input 
to be passed to run-time program interpreters. Solutions 
have employed both static [12, 21] and dynamic [11, 17] 
analysis. Ur/Web rules out these problems by construc- 
tion, by encoding the syntax of HTML and SQL with 
richly-typed objects. 

Rizvi et al. [18] present a technique for fine-grained 
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access control over SQL queries, based on the concept 
of authorization views, which are much like UrFlow’s 
policy queries. The key difference is that authorization 
views are phrased in terms of variables like Suser—id 
that must be filled in by some out-of-band mechanism. 
With UrFlow, the correctness of authentication may itself 
be verified, through reasoning about the known predi- 
cate. The technique of Rizvi et al. is applied dynami- 
cally to individual queries, where an allowability check 
against the current database must be run for each query. 
In contrast, UrFlow can prove statically that an appli- 
cation never uses query results inappropriately, with no 
modification to run-time database operation. 

The SELinks system [7] extends the Links [6] Web 
programming language with support for static tracking 
of labels through trusted functions that enforce custom 
policies. The natural way of expressing some queries 
in SELinks involves mixing customized access control 
checks with code that should be compiled into SQL 
queries. The SELinks compiler handles the translation 
of the custom checks into stored procedures that the 
database engine can run during query evaluation. Ur- 
Flow follows the alternate approach of letting the pro- 
grammer be explicit about the interaction of checks and 
queries, such that the static analysis verifies that all this 
has been done correctly. In general, SELinks provides a 
type system that makes certain types of security proofs 
easier, though the SELinks compiler does not carry out 
those proofs itself. 


10 Conclusion 


We have presented UrFlow, a static program analysis 
that verifies adherence of database-backed Web applica- 
tions to security policies. These policies may vary by 
database state, and they are expressed as SQL queries, a 
convenient format for most Web programmers. UrFlow 
requires no program annotations and adds no run-time 
overhead. A key direction for future work is adaptation 
of UrFlow to more traditional languages, where database 
access is granted less of a first-class status, so that pro- 
gram analysis must be run to recover some information 
that UrFlow depends on. 
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Abstract 


In this paper, we introduce accountable virtual ma- 
chines (AVMs). Like ordinary virtual machines, AVMs 
can execute binary software images in a virtualized copy 
of a computer system; in addition, they can record 
non-repudiable information that allows auditors to sub- 
sequently check whether the software behaved as in- 
tended. AVMs provide strong accountability, which is 
important, for instance, in distributed systems where dif- 
ferent hosts and organizations do not necessarily trust 
each other, or where software is hosted on third-party 
operated platforms. AVMs can provide accountability 
for unmodified binary images and do not require trusted 
hardware. To demonstrate that AVMs are practical, we 
have designed and implemented a prototype AVM mon- 
itor based on VMware Workstation, and used it to detect 
several existing cheats in Counterstrike, a popular online 
multi-player game. 


1 Introduction 


An accountable virtual machine (AVM) provides users 
with the capability to audit the execution of a software 
system by obtaining a log of the execution, and compar- 
ing it to a known-good execution. This capability is par- 
ticularly useful when users rely on software and services 
running on machines owned or operated by third par- 
ties. Auditing works for any binary image that executes 
inside the AVM and does not require that the user trust 
either the hardware or the accountable virtual machine 
monitor on which the image executes. Several classes of 
systems exemplify scenarios where AVMs are useful: 


e in a competitive system, such as an online game 
or an auction, users may wish to verify that other 
players do not cheat, and that the provider of the 
service implements the stated rules faithfully; 

© nodes in peer-to-peer and federated systems may 
wish to verify that others follow the protocol and 
contribute their fair share of resources; 

e cloud computing customers may wish to verify that 
the provider executes their code as intended. 
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In these scenarios, software and hardware faults, mis- 
configurations, break-ins, and deliberate manipulation 
can lead to an abnormal execution, which can be costly 
to users and operators, and may be difficult to detect. 
When such a malfunction occurs, it is difficult to estab- 
lish who is responsible for the problem, and even more 
challenging to produce evidence that proves a party’s 
innocence or guilt. For example, in a cloud computing 
environment, failures can be caused both by bugs in the 
customer’s software and by faults or misconfiguration of 
the provider’s platform. If the failure was the result of a 
bug, the provider would like to be able to prove his own 
innocence, and if the provider was at fault, the customer 
would like to obtain proof of that fact. 

AVMs address these problems by providing users 
with the capability to detect faults, to identify the faulty 
node, and to produce evidence that connects the fault 
to the machine that caused it. These properties are 
achieved by running systems inside a virtual machine 
that 1) maintains a log with enough information to re- 
produce the entire execution of the system, and that 2) 
associates each outgoing message with a cryptographic 
record that links that action to the log of the execution 
that produced it. The log enables users to detect faults 
by replaying segments of the execution using a known- 
good copy of the system, and by cross-checking the ex- 
ternally visible behavior of that copy with the previously 
observed behavior. AVMs can provide this capability for 
any black-box binary image that can be run inside a VM. 

AVMs detect integrity violations of an execution 
without requiring the audited machine to run hardware 
or software components that are trusted by the auditor. 
When such trusted components are available, AVMs can 
be extended to detect some confidentiality violations as 
well, such as private data leaking out of the AVM. 

This paper makes three contributions: 1) it introduces 
the concept of AVMs, 2) it presents the design of an 
accountable virtual machine monitor (AVMM), and 3) 
it demonstrates that AVMs are practical for a specific 
application, namely the detection of cheating in multi- 
player games. Cheat detection is an interesting example 
application because it is a serious and well-understood 
problem for which AVMs are effective: they can detect 
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a large and general class of cheats. Out of 26 existing 
cheats we downloaded from the Internet, AVMs can de- 
tect every single one—without prior knowledge of the 
cheat’s nature or implementation. 

We have built a prototype AVMM based on VMware 
Workstation, and used it to detect real cheats in Coun- 
terstrike, a popular multi-player game. Our evaluation 
shows that the costs of accountability in this context are 
moderate: the frame rate drops by 13%, from 158 fps on 
bare hardware to 137 fps on our prototype, the ping time 
increases by about 5 ms, and each player must store or 
transmit a log that grows by about 148 MB per hour af- 
ter compression. Most of this overhead is caused by log- 
ging the execution; the additional cost for accountabil- 
ity is comparatively small. The log can be transferred 
to other players and replayed there during the game (on- 
line) or after the game has finished (offline). 

While our evaluation in this paper focuses on games 
as an example application, AVMs are useful in other 
contexts, e.g., in p2p and federated systems, or to verify 
that a cloud platform is providing its services correctly 
and is allocating the promised resources [18]. Our pro- 
totype AVMM already supports techniques such as par- 
tial audits that would be useful for such applications, but 
a full evaluation is beyond the scope of this paper. 

The rest of this paper is structured as follows. Sec- 
tion 2 discusses related work, Section 3 explains the 
AVM approach, and Section 4 presents the design of our 
prototype AVMM. Sections 5 and 6 describe our imple- 
mentation and report evaluation results in the context of 
games. Section 7 describes other applications and pos- 
sible extensions, and Section 8 concludes this paper. 


2 Related work 


Deterministic replay: Our prototype AVMM relies on 
the ability to replay the execution of a virtual machine. 
Replay techniques have been studied for more than two 
decades, usually in the context of debugging, and ma- 
ture solutions are available [6, 15, 16, 39]. However, 
replay by itself is not sufficient to detect faults on a re- 
mote machine, since the machine could record incorrect 
information in such a way that the replay looks correct, 
or provide inconsistent information to different auditors. 
Improving the efficiency of replay is an active re- 
search area. Remus [11] contributes a highly efficient 
snapshotting mechanism, and many current efforts seek 
to improve the efficiency of logging and replay for 
multi-core systems [13, 16, 28, 29]. AVMMs can di- 
rectly benefit from these innovations. 
Accountability: Accountability in distributed systems 
has been suggested as a means to achieve practical se- 
curity [26], to create an incentive for cooperative be- 
havior [14], to foster innovation and competition in the 
Internet [4, 27], and even as a general design goal for 
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dependable networked systems [43]. Several prior sys- 
tems provide accountability for specific applications, in- 
cluding network storage services [44], peer-to-peer con- 
tent distribution networks [31], and interdomain rout- 
ing [2, 20]. Unlike these systems, AVMs are application 
independent. PeerReview [21] provides accountability 
for general distributed systems. However, PeerReview 
must be closely integrated with the application, which 
requires source code modifications and a detailed under- 
standing of the application logic. It would be impracti- 
cal to apply PeerReview to an entire VM image with 
dozens of applications and without access to the source 
code of each. AVMs do not have these limitations; they 
can make software accountable ‘out of the box’. 
Remote fault detection: GridCop [42] is a compiler- 
based technique that can be used to monitor the progress 
and execution of a remotely executing program by in- 
specting periodic beacon packets. GridCop is designed 
for a less hostile environment than AVMs: it assumes a 
trusted platform and self-interested hosts. Also, Grid- 
Cop does not work for unmodified binaries, and it can- 
not produce evidence that would convince a third party 
that a fault did or did not happen. 

A trusted computing platform can be used to detect if 
a node is running modified software [17, 30]. The ap- 
proach requires trusted hardware, a trusted OS kernel, 
and a software and hardware certification infrastructure. 
Pioneer [36] can detect such modifications using only 
software, but it relies on recognizing sub-millisecond 
delay variations, which restricts its use to small net- 
works. AVMs do not require any trusted hardware and 
can be used in wide-area networks. 
Cheat detection: Cheating in online games is an impor- 
tant problem that affects game players and game oper- 
ators alike [24]. Several cheat detection techniques are 
available, such as scanning for known hacks [23, 35] or 
defenses against specific forms of cheating [7, 32]. In 
contrast to these, AVMs are generic; that is, they are ef- 
fective against an entire class of cheats. Chambers et 
al. [9] describe another technique to detect if players 
lie about their game state. The system relies on a form 
of tamper-evident logs; however, the log must be inte- 
grated with the game, while AVMs work for unmodified 
games. 


3 Accountable Virtual Machines 
3.1 Goals 


Figure | depicts the basic scenario we are concerned 
with in this paper. Alice is relying on Bob to run some 
software S on a machine M, which is under Bob’s con- 
trol. However, Alice cannot observe // directly, she can 
only communicate with it over the network. Our goal 
is to enable Alice to check whether M behaves as ex- 


USENIX Association 


USENIX Association 


Software S 


8s: &> 48 2 


Alice Network Machine M_ Bob 


Figure 1: Basic scenario. Alice is relying on software 
S, which is running on a machine that is under Bob’s 
control. Alice would like to verify that the machine is 
working properly, and that Bob has not modified S. 


pected, without having to trust Bob, M, or any software 
running on M. 

To define the behavior Alice expects M to have, we 
assume that Alice has some reference implementation of 
M called Mp, which runs S. We say that M is correct 
iff Mp can produce the same network output as / when 
it is started in the same initial state and given precisely 
the same network inputs. If M is not correct, we say 
that it is faulty. This can happen if differs from Mp, 
or Bob has installed software other than S. Our goal is 
to provide the following properties: 


e Detection: If J/ is faulty, Alice can detect this. 


e Evidence: When Alice detects a fault on M/, she 
can obtain evidence that would convince a third 
party that M is faulty, without requiring that this 
party trust Alice or Bob. 


We are particularly interested in solutions that work for 
any software S' that can execute on M and Mp. For 
example, S could be a program binary that was com- 
piled by someone other than Alice, it could be a complex 
application whose details neither Alice nor Bob under- 
stand, or it could be an entire operating system image 
running a commodity OS like Linux or Windows. 

In the rest of this paper, we will omit explicit refer- 
ences to S when it is clear from the context which soft- 
ware MM is expected to run. 


3.2 Approach 


To detect faults on M/, Alice must be able to answer 
two questions: 1) which exact sequence of network mes- 
sages did M send and receive, and 2) is there a correct 
execution of I/p that is consistent with this sequence of 
messages? Answering the former is not trivial because 
a faulty 14 —or a malicious Bob—could try to falsify 
the answer. Answering the latter is difficult because the 
number of possible executions for any nontrivial soft- 
ware is large. 

Alice can solve this problem by combining two seem- 
ingly unrelated technologies: tamper-evident logs and 
virtual machines. A tamper-evident log [21] requires 
each node to record all the messages it has sent or re- 
ceived. Whenever a message is transmitted, the sender 


and the receiver must prove to each other that they have 
added the message to their logs, and they must commit 
to the contents of their logs by exchanging an authenti- 
cator — essentially, a signed hash of the log. The authen- 
ticators provide nonrepudiation, and they can be used to 
detect when a node tampers with its log, e.g., by forging, 
omitting, or modifying messages, or by forking the log. 

Once Alice has determined that /’s message log is 
genuine, she must either find a correct execution of M/ Rp 
that matches this log, or establish that there isn’t one. To 
help Alice with this task, 17 can be required to record 
additional information about nondeterministic events in 
the execution of S. Given this information, Alice can 
use deterministic replay [8, 15] to find the correct exe- 
cution on Mp, provided that one exists. 

Recording the relevant nondeterministic events seems 
difficult at first because we have assumed that neither 
Alice nor Bob have the expertise to make modifications 
to S; however, Bob can avoid this by using a virtual 
machine monitor (VMM) to monitor the execution of 
and to capture inputs and nondeterministic events in a 
generic, application-independent way. 


3.3. AVM monitors 


The above building blocks can be combined to con- 
struct an accountable virtual machine monitor (AVMM), 
which implements AVMs. Alice and Bob can use an 
AVMM to achieve the goals from Section 3.1 as follows: 


1. Bob installs an AVMM on his computer and runs 
the software S inside an AVM. (From this point 
forward, M refers to the entire stack consisting 
of Bob’s computer, the AVMM running on Bob’s 
computer, and Alice’s virtual machine image S, 
which runs on the AVMM.) 


2. The AVMM maintains a tamper-evident log of the 
messages M sends or receives, and it also records 
any nondeterministic events that affect S. 


3. When Alice receives a message from M, she de- 
taches the authenticator and saves it for later. 


4. Alice periodically audits M/ as follows: she asks 
the AVMM for its log, verifies it against the au- 
thenticators she has collected, and then uses deter- 
ministic replay to check the log for faults. 


5. If replay fails or the log cannot be verified against 
one of the authenticators, Alice can give Mp, S, 
the log, and the authenticators to a third party, who 
can repeat Alice’s checks and independently verify 
that a fault has occurred. 


This generic methodology meets our previously stated 
goals: Alice can detect faults on M, she can obtain evi- 
dence, and a third party can check the evidence without 
having to trust either Alice or Bob. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10) = 121 


122 





(a) Symmetric multi-party scenario (online game) 
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(b) Asymmetric multi-party scenario (web service) 


Figure 2: Multi-party scenarios. The scenario on the left represents a multi-player game; each player is running the 
game client on his local machine and wants to know whether any other players are cheating. The scenario on the right 
represents a hosted web service: Alice’s software is running on Bob’s machine, but the software typically interacts 


with users other than Alice, such as Alice’s customers. 


3.4 Does the AVMM have to be trusted? 


A perhaps surprising consequence of this approach is 
that the AVMM does not have to be trusted by Alice. 
Suppose Bob is malicious and secretly tampers with Al- 
ice’s software and/or the AVMM, causing M/ to become 
faulty. Bob cannot prevent Alice from detecting this: if 
he tampers with M’s log, Alice can tell because the log 
will not match the authenticators; if he does not, Alice 
obtains the exact sequence of observable messages 
has sent and received, and since by our definition of a 
fault there is no correct execution of / pz that is consis- 
tent with this sequence, deterministic replay inevitably 
fails, no matter what the AVMM recorded. 


3.5 Must Alice check the entire log? 


For many applications, including the game we consider 
in this paper, it is perfectly feasible for Alice to audit 
M’s entire log. However, for long-running, compute- 
intensive applications, Alice may want to save time by 
doing spot checks on a few log segments instead. The 
AVMM can enable her to do this by periodically taking 
a snapshot of the AVM’s state. Thus, Alice can inde- 
pendently inspect any segment that begins and ends at a 
snapshot. 

Spot checking sacrifices the completeness of fault de- 
tection for efficiency. If Alice chooses to do spot checks, 
she can only detect faults that manifest as incorrect state 
transitions in the segments she inspects. An incorrect 
state transition in an unchecked segment, on the other 
hand, could permanently modify M/’s state in a way 
that is not detectable by checking subsequent segments. 
Therefore, Alice must be careful when choosing an ap- 
propriate policy. 

Alice could inspect a random sample of segments plus 
any segments in which a fault could most likely have a 
long-term effect on the AVM’s state (e.g., during initial- 
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ization, authentication, key generation). Or, she could 
inspect segments when she observes suspicious results, 
starting with the most recent segment and working back- 
wards in reverse chronological order. Spot-checking is 
most effective in applications where the faults of interest 
likely occur repeatedly and a single instance causes lim- 
ited harm, where the application state is frequently re- 
initialized (preventing long-term effects of a single un- 
detected fault on the state), or where the threat of prob- 
abilistic detection is strong enough to deter attackers. 


3.6 Do AVMs work with multiple parties? 


So far, we have focused on a simple two-party scenario; 
however, AVMs can be used in more complex scenar- 
ios. Figure 2 shows two examples. In the scenario on 
the left, the players in an online multi-player game are 
using AVMs to detect whether someone is cheating. Un- 
like the basic scenario in Figure 1, this scenario is sym- 
metric in the sense that each player is both running soft- 
ware and is interested in the correctness of the software 
on all the other machines. Thus, the roles of auditor 
and auditee can be played by different parties at differ- 
ent times. The scenario on the right represents a hosted 
web service: the software is controlled and audited by 
Alice, but the software typically interacts with parties 
other than Alice, such as Alice’s customers. 


For clarity, we will explain our system mostly in 
terms of the simple two-party scenario in Figure 1. In 
Section 4.6, we will describe differences for the multi- 
party case. 


4 AVMM design 


To demonstrate that AVMs are practical, we now present 
the design of a specific AVMM. 
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4.1 Assumptions 


Our design relies on the following assumptions: 


1. All transmitted messages are eventually received, 
if retransmitted sufficiently often. 


2. All parties (machines and users) have access to a 
hash function that is pre-image resistant, second 
pre-image resistant, and collision resistant. 


3. Each party has a certified keypair, which can be 
used to sign messages. Neither signatures nor cer- 
tificates can be forged. 


4. If a user needs to audit the log of a machine, the 
user has access to a reference copy of the VM im- 
age that the machine is expected to use. 


The first two are common assumptions made about prac- 
tical distributed systems. In particular, the first assump- 
tion is required for liveness, otherwise it could be im- 
possible to ever complete an audit. The third assump- 
tion could be satisfied by providing each machine with a 
keypair that is signed by the administrator; it is needed 
to prevent faulty machines from creating fake identities. 
The fourth assumption is required so that the auditor 
knows which behaviors are correct. 


4.2 Roadmap 


Our design instantiates each of the building blocks we 
have described in Section 3.2: a VMM, a tamper-evident 
log, and an auditing mechanism. Here, we give a brief 
overview; the rest of this section describes each building 
block in more detail. 

For the tamper-evident log (Section 4.3), we adapt a 
technique from PeerReview [21], which already comes 
with a proof of correctness [22]. We extend this log to 
also include the VMM’s execution trace. 

The VMM we use in this design (Section 4.4) virtual- 
izes a standard commodity PC. This platform is attrac- 
tive because of the vast amount of existing software that 
can run on it; however, for historical reasons, it is harder 
to virtualize than a more modern platform such as Java 
or .NET. In addition, interactions between the software 
and the virtual ‘hardware’ are much more frequent than, 
e.g., in Java, resulting in a potentially higher overhead. 

For auditing (Section 4.5), we provide a tool that au- 
thenticates the log, then checks it for tampering, and 
finally uses deterministic replay to determine whether 
the contents of the log correspond to a correct execu- 
tion of Mp. If the tool finds any discrepancy between 
the events in the log and the events that occur during 
replay, this indicates a fault. Note that, while events 
such as thread scheduling may appear nondeterminis- 
tic to an application, they are in fact deterministic from 
the VMM’s perspective. Therefore, as long as all ex- 
ternal events (e.g. timer interrupts) are recorded in the 


log, even race conditions are reproduced exactly during 
replay and cannot result in false positives. ! 


4.3 Tamper-evident log 


The tamper-evident log is structured as a hash chain; 
each log entry is of the form e; := (5;,t;, c;, hi), where 
8; 18 a monotonically increasing sequence number, t; 
a type, and c; data of the specified type. h; is a hash 
value that must be linked to all the previous entries in 
the log, and yet efficient to create. Hence, we compute 
itash; = A(hj-1 || Sj || L; || H(c;)) where ho := 0, H 
is a hash function, and || stands for concatenation. 

To detect when Bob’s machine M forges incoming 
messages, Alice signs each of her messages with her 
own private key. The AVMM logs the signatures to- 
gether with the messages, so that they can be verified 
during an audit, but it removes them before passing the 
messages on to the AVM. Thus, this process is transpar- 
ent to the software running inside the AVM. 


To ensure nonrepudiation, the AVMM attaches an 
authenticator to each outgoing message m. The au- 
thenticator for an entry e; is a; := (5;, hi, 7(5; || hi)), 
where the o(-) operator denotes a cryptographic sig- 
nature with the machine’s private key. M also in- 
cludes hj_1, so that Alice can recalculate hj; = 
H(hi_1 || 8; || SEND || H(m)) and thus verify that the 
entry e; is in fact SEND(m). 





To detect when / drops incoming or outgoing mes- 
sages, both Alice and the AVMM send an acknowledg- 
ment for each message m they receive. Analogous to the 
above, /’s authenticator in the acknowledgment con- 
tains enough information for the recipient to verify that 
the corresponding entry is RECV(m). Alice’s own ac- 
knowledgment contains just a signed hash of the cor- 
responding message, which the AVMM logs for Alice. 
When an acknowledgment is not received, the original 
message is retransmitted a few times. If Alice stops re- 
ceiving messages from JV altogether, she can only sus- 
pect that M has failed. 


When Alice wants to audit M, she retrieves a pair of 
authenticators (e.g., the ones with the lowest and highest 
sequence numbers) and challenges M to produce the log 
segment that connects them. She then verifies that the 
hash chain is intact. Because the hash function is second 
pre-image resistant, it is computationally infeasible to 
modify the log without breaking the hash chain. Thus, 
if M has reordered or tampered with a log entry in that 
segment, or if it has forked its log, /7’s hash chain will 
no longer match its previously issued authenticators, and 
Alice can detect this using this check. 


'Ensuring deterministic replay on multiprocessor machines is 
more difficult. We will discuss this in Section 7.4. 
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4.4 Virtual machine monitor 


In addition to recording all incoming and outgoing mes- 
sages to the tamper-evident log, the AVMM logs enough 
information about the execution of the software to en- 
able deterministic replay. 

Recording nondeterministic inputs: The AVMM must 
record all of the AVM’s nondeterministic inputs [8]. If 
an input is asynchronous, the precise timing within the 
execution must be recorded, so that the input can be re- 
injected at the exact same point during replay. Hardware 
interrupts, for example, fall into this category. Note that 
wall-clock time is not sufficiently precise to describe the 
timing of such inputs, since the instruction timing can 
vary on most modern CPUs. Instead, the AVMM uses a 
combination of instruction pointer, branch counter, and, 
where necessary, additional registers [15]. 

Not all inputs are nondeterministic. For example, the 
values returned by accesses to the AVM’s virtual hard- 
disk need not be recorded. Alice knows the system im- 
age that the machine is expected to use, and can thus 
reconstruct the correct inputs during replay. Also many 
inputs such as software interrupts are synchronous, that 
is, they are explicitly requested by the AVM. Here, the 
timing need not be recorded because the requests will be 
issued again during replay. 

Detecting inconsistencies: The tamper-evident log now 
contains two parallel streams of information: Message 
exchanges and nondeterministic inputs. Incoming mes- 
sages appear in both streams: first as messages, and 
then, as the AVM reads the bytes in the message, as a 
sequence of inputs. If Bob is malicious, he might try to 
exploit this by forging messages or by dropping or mod- 
ifying a message that was received on M before it is 
injected into the AVM. To detect this, the AVMM cross- 
references messages and inputs in such a way that any 
discrepancies can easily be detected during replay. 
Snapshots: To enable spot checking and incremental 
audits (Section 3.5), the AVMM periodically takes a 
snapshot of the AVM’s current state. To save space, 
snapshots are incremental, that is, they only contain 
the state that has changed since the last snapshot. The 
AVMM also maintains a hash tree over the state; af- 
ter each snapshot, it updates the tree and then records 
the top-level value in the log. When Alice audits a log 
segment, she can either download an entire snapshot or 
incrementally request the parts of the state that are ac- 
cessed during replay. In either case, she can use the hash 
tree to authenticate the state she has downloaded. 

Taking frequent snapshots enables Alice to perform 
fine-grain audits, but it also increases the overhead. 
However, snapshotting techniques have become very ef- 
ficient; recent work on VM replication has shown that 
incremental snapshots can be taken up to 40 times per 
second [11] and with only brief interruptions of the VM, 
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on the order of a few milliseconds. Accountability re- 
quires only infrequent snapshots (once every few min- 
utes or hours), so the overhead should be low. 


4.5 Auditing and replay 


When Alice wants to audit a machine M, she performs 
the following three steps. First, Alice obtains a segment 
of M’s log and the authenticators that 7 produced dur- 
ing the execution, so that the log’s integrity can be ver- 
ified. Second, she downloads a snapshot of the AVM 
at the beginning of the segment. Finally, she replays 
the entire segment, starting from the snapshot, to check 
whether the events in the log correspond to a correct ex- 
ecution of the reference software. 

Verifying the log: When Alice wants to audit a log 
segment e;...e;, She retrieves the authenticators she 
has received from M with sequence numbers in [s;, 5]. 
Next, Alice downloads the corresponding log segment 
L;; from M, starting with the most recent snapshot be- 
fore e; and ending at e;; then she verifies the segment 
against the authenticators to check for tampering. If this 
step succeeds, Alice is convinced that the log segment 
is genuine; thus, she is left with having to establish that 
the execution described by the segment is correct. 

If M is faulty, Alice may not be able to download 
Lj; at all, or M could return a corrupted log segment 
that causes verification to fail. In either case, Alice can 
use the most recent authenticator a; as evidence to con- 
vince a third party of the fault. Since the authenticator 
is signed, the third party can use a; to verify that log 
entries with sequence numbers up to s; must exist; then 
it can repeat Alice’s audit. If no reply is obtained, Alice 
will suspect Bob. 

Verifying the snapshot: Next, Alice must obtain a 
snapshot of the AVM’s state at the beginning of the log 
segment L;;. If Alice is auditing the entire execution, 
she can simply use the original software image S. Oth- 
erwise she downloads a snapshot from M and recom- 
putes the hash tree to authenticate it against the hash 
value in L;;. 

Verifying the execution: For the final step, Alice needs 
three inputs: The log segment L;;, the VM snapshot, 
and the public keys of MM and any users who communi- 
cated with M/. The audit tool performs two checks on 
Lj, a syntactic check and a semantic check. The syn- 
tactic check determines whether the log itself is well- 
formed, whereas the semantic check determines whether 
the information in the log corresponds to a correct exe- 
cution of Mp. 

For the syntactic check, the audit tool checks whether 
all log entries have the proper format, verifies the cryp- 
tographic signatures in each message and acknowledg- 
ment, checks whether each message was acknowledged, 
and checks whether the sequence of sent and received 
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messages corresponds to the sequence of messages that 
enter and exit the AVM. If any of these tests fail, the tool 
reports a fault. 


For the semantic check, the tool locally instantiates a 
virtual machine that implements Mp, and it initializes 
the machine with the snapshot, if any, or S. Next, it 
reads L;; from beginning to end, replaying the inputs, 
checking the outputs against the outputs in L;;, and ver- 
ifying any snapshot hashes in L;; against snapshots of 
the replayed execution (to be sure that the snapshot at 
the end of L;,; is also correct). If there is any discrep- 
ancy whatsoever (for example, if the virtual machine 
produces outputs that are not in the log, or if it requests 
the synchronous inputs in a different order), replay ter- 
minates and reports a fault. In this case, Alice can use 
L,,; and the authenticators as evidence to convince Bob, 
or any other interested party, that / is faulty. 


If the log segment L;; passes all of the above checks, 
the tool reports success and then terminates. Auditing 
can be performed offline (after the execution of a given 
log segment is finished) or online (while the execution 
is in progress). 


4.6 Miulti-party scenario 


So far, we have described the AVMM in terms of the 
simple two-party scenario. A multi-party scenario re- 
quires three changes. First, when some user wants to 
audit a machine J/, he needs to collect authenticators 
from other users that may have communicated with M. 
In the gaming scenario in Figure 2(a), Alice could down- 
load authenticators from Charlie before auditing Bob. In 
the web-service scenario in Figure 2(b), the users could 
forward any authenticators they receive to Alice. 


Second, with more than two parties, network prob- 
lems could make the same node appear unresponsive to 
some nodes and alive to others. Bob could exploit this, 
for instance, to avoid responding to Alice’s request for 
an incriminating log segment, while continuing to work 
with other nodes. To prevent this type of attack, Al- 
ice forwards the message that M does not answer as a 
challenge for M to the other nodes. All nodes stop com- 
municating with M until it responds to the challenge. If 
M is correct but there is a network problem between MZ 
and Alice, or M was temporarily unresponsive, it can 
answer the challenge and its response is forwarded to 
Alice. 


Third, when one user obtains evidence of a fault, he 
may need to distribute that evidence to other interested 
parties. For example, in the gaming scenario, if Alice 
detects that Bob is cheating, she can send the evidence 
to Charlie, who can verify it independently; then both 
can decide never to play with Bob again. 


4.7 Guarantees 


Given our assumptions from Section 4.1 and the fault 
definition from Section 3.1, the AVMM offers the fol- 
lowing two guarantees: 


e Completeness: If the machine M is faulty, a full 
audit of M will report a fault and produce evidence 
against M that can be verified by a third party. 

e Accuracy: If the machine M is not faulty, no audit 
of M will report a fault, and there cannot exist any 
valid evidence against M. 


If Alice performs spot checks on a number of log seg- 
ments s;,..., 5% rather than a full audit, accuracy still 
holds. However, if / is faulty, her audit will only re- 
port the fault and produce evidence if there exists at least 
one log segment s; in which the fault manifests. These 
guarantees are independent of the software S, and they 
hold for any fault that manifests as a deviation from M pr, 
even if Alice, Bob, and/or other users are malicious. A 
proof of these properties is presented in a separate tech- 
nical report [19]. 

Since our design is based on the tamper-evident log 
from PeerReview [21], the resulting AVMM inherits a 
powerful property from PeerReview: in a distributed 
system with multiple nodes, it is possible to audit the 
execution of the entire system by auditing each node in- 
dividually. For more details, please refer to [21]. 


4.8 Limitations 


We note two limitations implied by the AVMM’s guar- 
antees. First, AVMs cannot detect bugs or vulnerabili- 
ties in the software S', because the expected behavior of 
M is defined by Mp and thus S. If S has a bug and the 
bug is exercised during an execution, an audit will suc- 
ceed. For instance, if S allows unauthorized software 
modifications, Bob could use this feature to change or 
replace S. Alice must therefore make sure that S' does 
not have vulnerabilities that Bob could exploit. 

Second, any behavior that can be achieved by pro- 
viding appropriate inputs to M/p is considered correct. 
When such inputs come from sources other than the net- 
work, they cannot be verified during an audit. In some 
applications, Bob may be able to exploit this fact by 
recording local (non-network) inputs in the log that elicit 
some behavior in Mp he desires. 


5 Application: Cheat detection in games 


AVMs and AVMMs are application-independent, but for 
our evaluation, we focus on one specific application, 
namely cheat detection. We begin by characterizing the 
class of cheats that AVMs can detect, and we discuss 
how AVMs compare to the anti-cheat systems that are 
in use today. 
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5.1 How are cheats detected today? 


Today, many online games use anti-cheating systems 
like PunkBuster [35], the Warden [23] or Valve Anti- 
Cheat (VAC) [38]. These systems work by scanning the 
user’s machine for known cheats [23, 24, 35]; some al- 
low the game admins to request screenshots or to per- 
form memory scans. In addition to privacy concerns, 
this approach has led to an arms race between cheaters 
and game maintainers, in which the former constantly 
release new cheats or variations of existing ones, and the 
latter must struggle to keep their databases up to date. 


5.2 How can AVMs be used with games? 


Recall that AVMs run entire VM images rather than in- 
dividual programs. Hence, the players first need to agree 
on a VM image that they will use. For example, one of 
them could install an operating system and the game it- 
self in a VM, create a snapshot of the VM, and then 
distribute the snapshot to the other players. Each player 
then initializes his AVM with the agreed-upon snapshot 
and plays while recording a log. If a player wishes to 
reassure himself that other players have not cheated, he 
can request their logs (during or after the game), check 
them for tampering, and replay them using his own, 
trusted copy of the agreed-upon VM image. 

Since many cheats involve installing additional pro- 
grams or modifying existing ones, it is important to dis- 
able software installation in the snapshot that is used 
during the game, e.g., by revoking the necessary privi- 
leges from all accounts that are accessible to the players. 
Otherwise, downloading and installing a cheat would 
simply be re-executed during replay without causing any 
discrepancies. However, note that this restriction is only 
required during the game; it does not prevent the main- 
tainer of the original VM image from installing upgrades 
or patches. 


5.3. How do players cheat in games? 


Players can cheat in many different ways — a recent tax- 
onomy [41] identified no less than fifteen different types 
of cheats, including collusion, denial of service, timing 
cheats, and social engineering. In Section 5.4, we dis- 
cuss which of these cheats AVMs are effective against, 
and we illustrate our discussion with three concrete ex- 
amples of cheats that are used in Counterstrike. Since 
the reader may not be familiar with these cheats, we de- 
scribe them here first. 

The first cheat is an aimbot. Its purpose to help the 
cheater with target acquisition. When the aimbot is ac- 
tive, the cheater only needs to point his weapon in the 
approximate direction of an opponent; the aimbot then 
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automatically aims the weapon exactly at that opponent. 
An aimbot is an example of a cheat that works, at least 
conceptually, by feeding the game with forged inputs. 

The second cheat is a wallhack. Its purpose is to al- 
low the cheater to see through opaque objects, such as 
walls. Wallhacks work because the game usually ren- 
ders a much larger part of the scenery than is actually 
visible on screen. Thus, if the textures on opaque ob- 
jects are made transparent or removed entirely, e.g., by 
a special graphics driver [37], the objects behind them 
become visible. A wallhack is an example of a cheat 
that violates secrecy; it reveals information that is avail- 
able to the game but is not meant to be displayed. 

The third cheat is unlimited ammunition. The vari- 
ant we used identifies the memory location in the Coun- 
terstrike process that holds the cheater’s current amount 
of ammunition, and then periodically writes a constant 
value to that location. Thus, even if the cheater con- 
stantly fires his weapon, he never runs out (similar 
cheats exist for other resources, e.g., unlimited health). 
This cheat changes the network-visible behavior of the 
cheater’s machine. It is representative of a larger class 
of cheats that rely on modifying local in-memory state; 
other examples include teleportation, which changes the 
variable that holds the player’s current position, or un- 
limited health. 


5.4 Which cheats can AVMs detect? 


AVMs are effective against two specific, broad classes 
of cheats, namely 


1. cheats that need to be installed along with the game 
in some way, e.g., as loadable modules, patches, or 
companion programs; and 

2. cheats that make the network-visible behavior of 
the cheater’s machine inconsistent with any correct 
execution. 


Both types of cheats cause replay to fail when the 
cheater’s machine is audited. In the first case, the reason 
is that replay can only succeed if the VM images used 
during recording and replay produce the same sequence 
of events recorded in the log. If different code is exe- 
cuted or different data is read at any time, replay almost 
certainly diverges soon afterward. In the second case, 
replay fails by definition because there exists no correct 
execution that is consistent with the network traffic the 
cheater’s machine has produced. 

If a cheat is in the first class but not in the second, 
it may be possible to re-engineer it to avoid detection. 
Common examples include cheats that violate secrecy, 
such as wallhacks, and cheats that rely on forged inputs, 
such as aimbots. For instance, a cheater might imple- 
ment an aimbot as a separate program that runs outside 
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Total number of cheats examined 
Cheats detectable with AVMs 
... in this specific implementation of the cheat 


... nO matter how the cheat is implemented 
Cheats not detectable with AVMs 





Table 1: Detectability of Counterstrike cheats from pop- 
ular Counterstrike discussion forums 


of the AVM and aims the player’s weapon by feeding 
fake inputs to the AVM’s USB port. A particularly tech- 
savvy cheater might even set up a second machine that 
uses a camera to capture the game state from the first 
machine’s screen and a robot arm to type commands on 
the first machine’s keyboard. While such cheats are by 
no means impossible, they do require substantially more 
effort and expertise than a simple patch or module that 
manipulates the game state directly. Thus, AVMs raise 
the bar significantly for such cheats. 


In contrast, cheats in the second class can be detected 
by AVMs in any implementation. Examples of such 
cheats include unlimited ammunition, unlimited health, 
or teleportation. For instance, if a player has & rounds 
of ammunition and uses a cheat of any type to fire more 
than & shots, replay inevitably fails because there is no 
correct execution of the game software in which a player 
can fire after having run out of ammunition. AVMs are 
effective against any current or future cheats that fall 
into this category. 


We hypothesize that the first class includes almost 
all cheats that are in use today. To test this hypothe- 
sis, we downloaded and examined 26 real Counterstrike 
cheats from popular discussion forums on the Internet 
(Table 1). We found that every single one of them had to 
be installed in the game AVM to be effective, and would 
therefore be detected. We also found that at least 4 of 
the 26 cheats additionally belonged to the second class 
and could therefore be detected not only in their current 
form, but also in any future implementation. 


5.5 Summary 


Even though we did not specifically design AVMs for 
cheat detection, they do offer three important advan- 
tages over current anti-cheating solutions like VAC or 
PunkBuster. First, they protect players’ privacy by sep- 
arating auditable computation (the game in the AVM) 
from non-auditable computation (e.g., browser or bank- 
ing software running outside the AVM). Second, they 
are effective against virtually all current cheats, includ- 
ing novel, rare, or unknown cheats. Third, they are guar- 
anteed to detect all possible cheats of a certain type, no 
matter how they are implemented. 


6 Evaluation 


In this section, we describe our AVMM prototype, and 
we report how we used it to detect cheating in Coun- 
terstrike, a popular multi-player game. Our goal is to 
answer the following three questions: 


1. Does the AVMM work with state-of-the-art games? 
2. Are AVMs effective against real cheats? 
3. Is the overhead low enough to be practical? 


6.1 Prototype implementation 


Our prototype AVMM implementation is based on 
VMware Workstation 6.5.1, a state-of-the-art virtual 
machine monitor whose source code we obtained 
through VMware’s Academic Program. VMware Work- 
station supports a wide range of guest operating sys- 
tems, including Linux and Microsoft Windows, and its 
VMM already supports many features that are useful 
for AVMs, such as deterministic replay and incremen- 
tal snapshots. We extended the VMM to record ex- 
tra information about incoming and outgoing network 
packets, and we added support for tamper-evident log- 
ging, for which we adapted code from PeerReview [21]. 
Since VMware Workstation only supports uniprocessor 
replay, our prototype is limited to AVMs with a single 
virtual core (see Section 7.4 for a discussion of multi- 
processor replay). However, most of the logging func- 
tionality is implemented in a separate daemon process 
that communicates with the VMM through kernel-level 
pipes, so the AVMM can take advantage of multi-core 
CPUs by using one of the cores for logging, crypto- 
graphic operations and auditing, while running AVMs 
on the other cores at full speed. 

Our audit tool implements a two-step process: Play- 
ers first perform the syntactic check using a separate tool 
and then run the semantic check by replaying the log ina 
local AVM, using a copy of the VM image they trust. If 
at least one of the two stages fails, they can give the log 
and the authenticators as evidence to fellow players— 
or, indeed, any third party. All steps are deterministic, 
so the other party will obtain the same result. 


6.2 Experimental setup 


For our evaluation, we used the AVMM prototype to de- 
tect cheating in Counterstrike. There are two reasons for 
this choice. First, Counterstrike is played in a variety of 
online leagues, as well as in worldwide championships 
such as the World Cyber Games, which makes cheat- 
ing a matter of serious concern. Second, there is a large 
and diverse ecosystem of readily available Counterstrike 
cheats, which we can use for our experiments. 

Our experiments are designed to model a Counter- 
strike game as it would be played at a competition or 
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Figure 3: Growth of the AVMM log, and an equivalent 
VMware log, while playing Counterstrike. 


at a LAN party. We used three Dell Precision T1500 
workstations, one for each player, with 8 GB of mem- 
ory and 2.8 GHz Intel Core i7 860 CPUs. Each CPU 
has four cores and two hyperthreads per core. The ma- 
chines were connected to the same switch via | Gbps 
Ethernet links, and they were running Linux 2.6.32 (De- 
bian 5.0.4) as the host operating system. On each ma- 
chine, we installed an AVMM binary that was based on 
a VMware Workstation 6.5.1 release build. Each player 
had access to an ‘official? VM snapshot, which con- 
tained Windows XP SP3 as the guest operating system, 
as well as Counterstrike 1.6 at patch version 1.1.2.5. 
Sound and voice were disabled in the game and in 
VMware. As discussed in Section 5.2, we configured 
the snapshot to disallow software installation. In the 
snapshot, the OS was already booted, and the player was 
logged in without administrator privileges. 

All players were using 768-bit RSA keys. These keys 
are not strong enough to provide long-term security, but 
in our scenario the signatures only need to last until any 
cheaters have been identified, i.e., at most a few days or 
weeks beyond the end of the game. In December 2009, 
factoring a 768-bit number took almost 2,000 Opteron- 
CPU years [3], so this key length should be safe for gam- 
ing purposes for some time to come. 

To quantify the costs of various aspects of AVMs, we 
ran experiments in five different configurations. bare- 
hw is our baseline configuration in which the game 
runs directly on the hardware, without virtualization. 
vmware-norec adds the virtual machine monitor with- 
out our modifications, and vmware-rec adds the logging 
for deterministic replay. avymm_-nosig uses our AVMM 
implementation without signatures, and avmm-rsa768 
is the full system as described. 

We removed the default frame rate cap of 72 fps, 
so that Counterstrike rendered frames as quickly as 
the available CPU resources allow and we can use the 
achieved frame rate as a performance metric. In Sec- 
tion 6.5 we consider a configuration with the default 
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Figure 4: Average log growth for Counterstrike by con- 
tent. The bars in front show the size after compression. 


frame rate cap. To make sure the performance of bare- 
hw and virtualized configurations can be compared, we 
configured the game to run without OpenGL, which is 
not supported in our version of VMware Workstation, 
and we ran the game in window rather than full-screen 
mode. We played each game for at least thirty minutes. 


6.3 Functionality check 


Recall from Section 5.4 that AVMs can detect by design 
all of the 26 cheats we examined. As a sanity check to 
validate our implementation, we tried four Counterstrike 
cheats in our collection that do not depend on OpenGL. 
For each cheat, we created a modified VM image that 
had the cheat preinstalled, and we ran an experiment in 
the avmm-rsa768 configuration where one of the play- 
ers used the special VM image and activated the cheat. 
We then audited each player; as expected, the audits of 
the honest players all succeeded, while the audits of the 
cheater failed due to a divergence during replay. 


6.4 Log size and contents 


The AVMM records a log of the AVM’s execution dur- 
ing game play. To determine how fast this log grows, 
we played the game in the avmm-rsa768 configuration, 
and we measured the log size over time. Figure 3 shows 
the results. The log grows slowly while players are join- 
ing the game (until about 3 minutes into the experiment) 
and then continues to grow steadily during game play, 
by about 8 MB per minute. For comparison, we also 
show the size of an equivalent VMware log; the differ- 
ence is due to the extra information that is required to 
make the log tamper-evident. 

Figure 4 shows the average log growth rate about the 
content. More than 70% of the AVMM log consist of 
information needed for replay; tamper-evident logging 
is responsible for the rest. The replay information con- 
sists mainly of TimeTracker entries (59%), which 
are used by the VMM to record the exact timing of 
events, and MAC-layer entries (14%), such as incom- 
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ing or outgoing network packets; other entry types ac- 
count for the remaining 27%. The composition of the 
VMware log differs slightly because the packet payloads 
are stored in the MAC-layer entries rather than in the 
tamper-evident logging entries. We also show results af- 
ter applying bzip2 and a lossless, VMM-specific (but 
application-independent) compression algorithm we de- 
veloped. This brings the average log growth rate to 
2.47 MB per minute. 

From these results, we can estimate that a one-hour 
game session would result in a 480 MB log, or 148 MB 
after compression. Thus, given that current hard disk 
capacities are measured in terabytes, storage should not 
be a problem, even for very long games. Also, when a 
player is audited, he must upload his log to his fellow 
players. If the game is played over the Internet, upload- 
ing a one-hour log would take about 21 minutes over 
a 1 Mbps upstream link. If the game is played over a 
LAN, e.g., at a competition, the upload would complete 
in a few seconds. To avoid detection delays, our pro- 
totype can also perform auditing concurrently with the 
game; we evaluate this feature in Section 6.11. 


6.5 Low growth with the frame rate cap 


Recall that Counterstrike was configured without a 
frame rate cap in our experiments, so that the mea- 
sured frame rate can be used as a performance met- 
ric. We discovered that when the frame rate cap is en- 
abled, Counterstrike appears to implement inter-frame 
delays by busy-waiting in a tight loop, reading the sys- 
tem clock. Since the AVMM has to log every clock ac- 
cess as a nondeterministic input, this increases the log 
growth considerably — by a factor of 18 when the default 
cap of 72 fps is used. 

To reduce the log growth for applications that exhibit 
this behavior, we experimented with the following opti- 
mization. Whenever the AVMM observes consecutive 
clock reads from the same AVM within 5 ys of each 
other, it delays the n.th consecutive read by 2”~?«50 pus, 
starting with the second read and up to a limit of 5 ms. 
The exponential progression of delays limits the number 
of clock reads during long waits, but does not unduly af- 
fect timing accuracy during short waits. 

This optimization is very effective: log growth is ac- 
tually 2% lower than reported in Section 6.4, with or 
without the frame-rate cap. Moreover, the uncapped 
frame rate is only 3% lower than the rate without the op- 
timization, which shows that the optimization has only 
a mild impact on game performance. 


6.6 Syntactic and semantic check 


Alice can audit another player Bob by checking Bob’s 
log against his authenticators (syntactic check) and by 
replaying Bob’s log using a trusted copy of the VM im- 


10 





mas Baseline 
gas With VMM 














2 8 5 :- With AVMM 
2 
= 6 
2 
2 
S 4 
£ 
2 
a 2 ] 
Bare VMware VMware AVMM AVMM 


hardware (recording) (nosig) (RSA-768) 
Figure 5: Median ping round-trip times. The error bars 
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age (semantic check). We expect the syntactic check 
to be relatively fast, since it is essentially a matter of 
verifying signatures, whereas the replay involves repeat- 
ing all the computations that were performed during the 
original game and should therefore take about as long 
as the game itself. Our experiments with the log of 
the server machine from the avmm-rsa768 configuration 
(which covers 2,216 seconds with 1,987 seconds of ac- 
tual game play) confirm this. We needed 34.7 seconds 
to compress the log, 13.2 seconds to decompress it, 6.9 
seconds for the syntactic check, and 1,977 seconds for 
the semantic check (2,031 seconds total). Replay was 
actually a bit faster because the AVMM skips any time 
periods in the recording during which the CPU was idle, 
e.g., before the game was started. 

Unlike the performance of the actual game, the speed 
of auditing is not critical because it can be performed 
at leisure, e.g., in the background while the machine is 
used for something else. 


6.7 Network traffic 


The AVMM increases the amount of network traffic for 
two reasons: First, it adds a cryptographic signature 
to each packet, and second, it encapsulates all packets 
in a TCP connection. To quantify this overhead, we 
measured the raw, IP-level network traffic in the bare- 
hw configuration and in the avmm-rsa768 configuration. 
On average, the machine hosting the game sent 22 kbps 
in bare-hw and 215.5 kbps in avmm-rsa768. 

This high relative increase is partly due to the fact 
that Counterstrike clients send extremely small packets 
of 50-60 bytes each, at 26 packets/sec, so the AVMM’s 
fixed per-packet overhead (which includes one crypto- 
graphic signature for each packet and one for each ac- 
knowledgment) has a much higher impact than it would 
for packets of average Internet packet size. However, 
in absolute terms, the traffic is still quite low and well 
within the capabilities of even a slow broadband up- 
stream. 
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Figure 6: Average CPU utilization in Counterstrike for 
each of the eight hyperthreads, and for the entire CPU. 


6.8 Latency 


The AVMM adds some latency to packet transmissions 
because of the logging and processing of authenticators. 
To quantify this, we ran an AVM in five different con- 
figurations and measured the round-trip time (RTT) of 
100 ICMP Echo Request packets. Figure 5 shows the 
median RTT, as well as the Sth and the 95th percentile. 
Since our machines are connected to the same switch, 
the RTT on bare hardware is only 192 ys; adding vir- 
tualization increases it to 525 pus, VMware recording to 
621 jus, and the daemon to above 2 ms. Enabling 768-bit 
RSA signatures brings the total RTT to about 5 ms. Re- 
call that both the ping and the pong are acknowledged, 
so four signatures need to be generated and verified. 
Since the critical threshold for interactive applications is 
well above 100 ms [12], 5 ms seem tolerable for games. 
The overhead could be reduced by using a signing al- 
gorithm such as ESIGN [34], which can generate and 
verify a 2046-bit signature in less than 125 us. 


6.9 CPU utilization 


Compared to a Counterstrike game on bare hardware, 
the AVMM requires additional CPU power for virtual- 
ization and for the tamper-evident log. To quantify this 
overhead, we measured the CPU utilization in five con- 
figurations, ranging from bare-hw to avmm-rsa768. To 
isolate the contribution from the tamper-evident log, we 
pinned the daemon process to hyperthread 0 (HT 0) in 
the AVMM experiments and restricted the game to the 
other hyperthreads except for HT 0’s hypertwin, HT 4, 
which shares a core with HT 0.7 One of the machines 
in our experiments runs the Counterstrike server in ad- 
dition to serving a player. To be conservative, we report 
numbers for this machine, as it has the highest load. 
Figure 6 shows the average utilization for each HT, 
as well as the average across the entire CPU. The uti- 
lization of HT 0 (below 8%) in the AVM experiments 


2Nevertheless, the load on HT 4 is not exactly zero because Linux 
performs kernel-level IRQ handling on lightly-loaded hyperthreads. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 





200 ; 
mms Baseline 


mas With VMM 
23:_ With AVMM 
150 


100 
5 
0 


Bare VMware VMware AVMM AVMM 
hardware (recording) (nosig) (RSA768) 











Average frame rate 


Oo 


Figure 7: Frame rate in Counterstrike for each of the 
three machines. The left machine was hosting the game. 


shows that the overhead from the tamper-evident log is 
low. The game is constantly busy rendering frames, but 
because the Counterstrike rendering engine is single- 
threaded, it cannot run on more than one HT at a time. 
The OS/VMM will sometimes schedule it on one HT 
and sometimes on another, thus we expect an average 
utilization over the eight HTs of 12.5%, which our re- 
sults confirm. 


6.10 Frame rate 


Since the game is rendering frames as fast as the avail- 
able CPU cycles allow, a meaningful metric for the CPU 
overhead is the achieved frame rate, which we consider 
next. To measure the frame rate, we wrote an AMX 
Mod X [1] script that increments a counter every time 
a frame is rendered. We read out this counter at the be- 
ginning and at the end of each game, and we divided 
the difference by the elapsed time. Figure 7 shows our 
results for each of the three machines. The results vary 
over time and among players, because the frame rate de- 
pends on the complexity of the scene being rendered, 
and thus on the path taken by each player. 

The frame rate on the AVMM is about 13% lower than 
on bare hardware. The biggest overhead seems to come 
from enabling recording in VMware Workstation, which 
causes the average frame rate to drop by about 11%. In 
absolute terms, the resulting frame rate (137 fps) is still 
very high; posts in Counterstrike forums generally rec- 
ommend configuring the game for about 60-80 fps. 

To quantify the advantage of running some of the 
AVMM logic on a different HT, we ran an additional ex- 
periment with both Counterstrike and all AVMM threads 
pinned to the same hyperthread. This reduced the aver- 
age frame rate by another 11 fps. 


6.11 Online auditing 


If a game session is long or the stakes are particularly 
high, players may wish to detect cheaters well before 
the end of the game. In such cases, players can incre- 
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Figure 8: Frame rate for each of the three machines with 
zero, one, or two online audits per machine. 


mentally audit other players’ logs while the game is still 
in progress. In this configuration, which we refer to as 
online auditing, cheating could be detected as soon as 
the externally visible behavior of the cheater’s machine 
deviates from that of the reference machine. 

Ifa player uses the same machine to concurrently play 
the game and audit other players, the higher resource 
consumption can affect game performance. To quantify 
this effect, we played the game in the avmm-rsa768 con- 
figuration with each player auditing zero, one, or two 
other players on the same machine. As before, we mea- 
sured the average frame rate experienced by each player. 

Figure 8 shows our results. With an increasing num- 
ber of players audited, the frame rate drops somewhat, 
from 137 fps with no audits to 104 fps with two audits. 
However, the drop is less pronounced than expected be- 
cause the audits can leverage the unused cores. If the 
number of audits a is increased further, we expect the 
game performance to eventually degrade with 1/a. 

Since replay is slightly slower than the original exe- 
cution, auditing falls behind the game by about four sec- 
onds per minute of play, even when the audit executes on 
an otherwise unloaded machine. To ensure quick detec- 
tion even during very long game sessions, we can com- 
pensate by artificially slowing down the original execu- 


tion. We found that a 5% slowdown was sufficient to 
allow the auditor to keep up; this reduced the frame rate 
by up to 7 fps. Note that a certain lag can actually be 
useful to prevent players from learning each other’s cur- 
rent positions or strategies through an audit. In practice, 
players may want to disallow audits of the current round 
and/or the most recent moments of game play. 


6.12 Spot checking 


Online games are not a very interesting use case for spot 
checking because complete audits are feasible. There- 
fore, we set up a simple additional experiment that mod- 
els a client/server system — specifically, a MySQL 5.0.51 
server in one AVM and a client running MySQL’s 
sql-bench benchmark in another. We ran this ex- 
periment for 75 minutes in the avmm-rsa768 configura- 
tion, and we recorded a snapshot every five minutes. We 
found that, on average, our prototype takes 5 seconds to 
record a snapshot. The incremental disk snapshots are 
between 1.9 MB and 91 MB, while each memory snap- 
shot occupies about 530 MB. The reason for the latter 
is that VMware Workstation creates a full dump of the 
AVM’s main memory (512 MB) for each snapshot. This 
could probably be optimized considerably, e.g., using 
techniques from Remus [11]. 


In the following, we refer to the part of the log be- 
tween two consecutive snapshots as a segment, and to 
k; consecutive segments as a k:-chunk. To quantify the 
costs of spot checking, we audited all possible k-chunks 
in our log for & € {1,3,5,9,12}, and measured the 
amount of data that must be transferred over the net- 
work, as well as the time it takes to replay the chunk. 
However, we excluded /:-chunks that start at the begin- 
ning of the log; these are atypical because a) they are 
the only chunks for which no memory or disk snapshots 
have to be transferred, and b) they have less activity be- 
cause the MySQL server is not yet running at the begin- 
ning. We report averages because the results for chunks 
with the same value of & never varied by more than 10%. 
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Figure 9 shows the results, normalized to the cost of 
a full audit. As expected, the cost grows with the chunk 
size k; however, there is an additional fixed cost per 
chunk for transferring the corresponding memory and 
disk snapshots. 


6.13. Summary 


Having reported our results, we now revisit our three ini- 
tial questions. We have demonstrated that our AVMM 
works out-of-the-box with Counterstrike, a state-of-the- 
art game, and we have shown that it is effective against 
real cheats we downloaded from Counterstrike forums 
on the Internet. AVMs are not free; they affect various 
metrics such as latency, traffic, or CPU utilization, and 
they reduce the frame rate by about 13%, compared to 
the rate achieved on bare hardware. In return for this 
overhead, players gain the ability to audit other players. 
Auditing takes time, in some cases as much as the game 
itself, but it seems time well spent because it either ex- 
poses a cheater or clears an innocent player of suspicion. 
AVMs provide this novel capability by combining two 
seemingly unrelated technologies, tamper-evident logs 
and virtualization. 


7 Discussion 


7.1 Other applications 


AVMs are application-independent and can be used in 
applications other than games. 

Distributed systems: AVMs can be used to make any 
distributed system accountable, simply by executing the 
software on each node within an AVM. The node soft- 
ware can be arbitrarily complex and available only as a 
binary system image. Accountability is useful in dis- 
tributed systems where principals have an interest in 
monitoring the behavior of other principals’ nodes, and 
where post factum detection is sufficient. Such systems 
include federated systems where no single entity has 
complete control or visibility of the entire system, where 
different parties compete (e.g., in an online game, an 
auction, or a federated system like the Internet) or where 
parties are expected to cooperate but lack adequate in- 
centives to do so (e.g., in a peer-to-peer system). 
Network traffic accountability: AVMs could also 
be useful in detecting advanced forms of malware 
that could escape online detection mechanisms. An 
AVM, combined with a traffic monitor that records 
a machine’s network communication, can capture the 
network-observable behavior of a machine, and replay it 
later with expensive intrusion detection (e.g., taint track- 
ing) in place. 

Cloud computing: Another potential application of 
AVMs is cloud computing. AVMs can enable cloud cus- 
tomers to verify that their software executes in the cloud 
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as expected. AVM are a perfect match for infrastructure- 
as-a-service (IaaS) clouds that offer customers a vir- 
tual machine. However, AVMs in the cloud face addi- 
tional challenges: auditors cannot easily replay the en- 
tire execution for lack of resources; accountable services 
must be able to interact with non-accountable clients; 
and, it may not be practical to sign every single packet. 
The first challenge can be addressed with spot checking 
(Section 3.5). We plan to address the remaining chal- 
lenges in future work. 


7.2 Using trust to get stronger guarantees 


One of the strengths of AVMs is that they can verify the 
integrity of a remote node’s execution without relying 
on trusted components. However, if trusted components 
are available, we can obtain additional guarantees. We 
sketch two possible extensions below. 

Secure local input: AVMs cannot detect the hypothet- 
ical re-engineered aimbot from Section 5.4 because ex- 
isting hardware does not authenticate events from local 
input devices, such as keyboards or mice. Thus, a com- 
promised AVMM can forge or suppress local inputs, and 
even a correct AVMM cannot know whether a given 
keystroke was generated by the user or synthesized by 
another program, or another machine. This limitation 
can be overcome by adding crypto support to the input 
devices. For example, keyboards could sign keystroke 
events before reporting them to the OS, and an auditor 
could verify that the keystrokes are genuine using the 
keyboard’s public key. Since most peripherals gener- 
ate input at relatively low rates, the necessary hardware 
should not be expensive to build. 

Trusted AVMM: If we can trust the AVMM that is run- 
ning on a remote node, we can detect additional classes 
of cheats and attacks, including certain attacks on con- 
fidentiality. For example, a trusted AVMM could estab- 
lish a secure channel between the AVM and Alice (even 
if the software in the AVM does not support encryption) 
and thus prevent Bob’s machine from leaking informa- 
tion by secretly communicating with other machines. A 
trusted AVMM could also prevent wallhacks (see Sec- 
tion 5.3) by controlling outside access to the machine’s 
graphics card. If trusted hardware, such as memory en- 
cryption [40] is available on Bob’s machine, the AVMM 
could even prevent Bob from reading information di- 
rectly from memory. Remote attestation could be used 
to make sure that a trusted AVMM is indeed running on 
a remote computer, e.g., using a system like Terra [17]. 


7.3. Accountability versus privacy 


Ideally, an accountability system should disclose to an 
auditor only the information strictly required to deter- 
mine that the auditee has met his obligations. By this 
standard, AVM logs are rather verbose: an AVM records 
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enough information to replay the execution of the soft- 
ware it is running. This is a price we pay for the gen- 
erality of AVMs—they can detect a large class of faults 
in complex software available only in binary form. In 
practice, the amount of extra information released can 
be controlled. 

Let us consider how the extra information captured 
in the AVM logs affect Alice and Bob’s privacy. The 
log reveals information about actions of Bob’s machine, 
but only about the execution inside a given AVM, and 
only to approved auditors. In the web service scenario 
(Figure 2b), Alice is presumably paying Bob for run- 
ning her software in an AVM, so she has every right to 
know about the execution of the software. Similarly, it 
is not unreasonable to expect players in a game to share 
information about their game execution. In either case, 
the auditor cannot observe executions the auditee may 
be running outside the audited AVM. 


Alice and Bob’s privacy may be affected when she 
uses part of the log as evidence to demonstrate a fault on 
Bob’s machine to a third party. The evidence reveals ad- 
ditional information about the AVM, including a snap- 
shot, to that party. Therefore, Alice should release evi- 
dence only to third parties that have a legitimate need to 
know about faults on Bob’s machine. To limit the extra 
information released to third parties, Alice can use the 
hash tree (Section 4.4) to remove any part of the snap- 
shot that is not necessary to replay the relevant segment. 


7.4 Replay for multiprocessors 


Our prototype AVMM can assign only a single CPU 
core for each AVM, because VMware’s deterministic re- 
play is limited to uniprocessors. SMP-ReVirt [16] has 
recently demonstrated that deterministic replay is also 
possible for multiprocessors, but its cost is substantially 
higher than the cost of uniprocessor replay. Because 
replay is a building block for many important applica- 
tions, such as forensics [15], replication [11], and de- 
bugging [25], there is considerable interest in develop- 
ing more efficient techniques [5, 13, 16, 28, 29]. As 
more efficient techniques become available, AVMMs 
can directly benefit from them. 


7.5 Bug detection 


Recall that AVMs define faults as deviations from the 
behavior of a reference implementation. If the reference 
implementation has a bug and this bug is triggered dur- 
ing an execution, it will behave identically during the 
replay, and thus it will not be classified as a fault. If 
a bug in the reference implementation permits unautho- 
rized software modification (e.g., a buffer overflow bug), 
then neither the modification itself nor the behavior of 
the modified software will be reported as a fault. 


Detecting bugs in the reference implementation is 
outside the fault model AVMs were designed to detect. 
However, deterministic execution replay provides an op- 
portunity to use sophisticated runtime analysis tools dur- 
ing auditing [10]. In particular, techniques whose run- 
time costs are too high for deployment in a live system 
could be used during an off-line replay. Taint track- 
ing, for instance, can reliably detect the unsafe use of 
data that were received from an untrusted source [33], 
thus detecting buffer overwrite attacks and other forms 
of unauthorized software installation. More generally, 
sophisticated runtime techniques can be used during re- 
play to detect bugs, vulnerabilities and attacks as part of 
a normal audit. 


8 Conclusion 


Accountable virtual machines (AVM) allow users to au- 
dit software executing on remote machines. An AVM 
can detect a large and general class of faults, and it pro- 
duces evidence that can be verified independently by a 
third party. At the same time, an AVM allows the op- 
erator of the remote machine to prove whether his ma- 
chine is correct. To demonstrate that AVMs are feasi- 
ble, we have designed and implemented an AVM mon- 
itor based on VMware Workstation and used it to de- 
tect real cheats in Counterstrike, a popular online multi- 
player game. Players can record their game execution in 
a tamper-evident manner at a modest cost in frame rate. 
Other players can audit the execution to detect cheats, 
either after the game has finished or concurrently with 
the game. The system is able to detect all of 26 existing 
cheats we examined. 


Acknowledgments 


We appreciate the detailed and helpful feedback from 
Jon Howell, the anonymous OSDI reviewers, and our 
shepherd, Mendel Rosenblum. We would like to thank 
VMware for making the source code of VMware Work- 
station available to us under the VMware Academic Pro- 
gram, and our technical contact, Jim Chow, who has 
been extremely helpful. Finally, we would like to thank 
our many enthusiastic Counterstrike volunteers. 


References 


[1] AMX Mod X project. http: //www. amxmodx.org/. 

[2] D. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, 
D. Moon, and S. Shenker. Accountable Internet protocol 
(AIP). In Proceedings of the ACM SIGCOMM Conference (SIG- 
COMM), Aug. 2008. 

[3] K. Aoki, J. Franke, A. K. Lenstra, E. Thomé, J. W. Bos, 
P. Gaudry, A. Kruppa, P. L. Montgomery, D. A. Osvik, H. te 
Riele, A. Timofeev, and P. Zimmerman. Factorization of a 768- 
bit RSA modulus. http://eprint.iacr.org/2010/ 
006.pdé£. 

[4] K. Argyraki, P. Maniatis, O. Irzak, and S. Shenker. An account- 
ability interface for the Internet. In Proceedings of the IEEE 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) —-133 


134 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


International Conference on Network Protocols (ICNP), Oct. 
2007. 

A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system- 
enforced deterministic parallelism. In Proceedings of the 
USENIX Symposium on Operating System Design and Imple- 
mentation (OSDI), Oct. 2010. 

P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, 
R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of vir- 
tualization. In Proceedings of the ACM Symposium on Operating 
Systems Principles (SOSP), Oct. 2003. 

N. E. Baughman, M. Liberatore, and B. N. Levine. Cheat-proof 
playout for centralized and peer-to-peer gaming. IEEE/ACM 
Transactions on Networking (ToN), 15(1):1-13, Feb. 2007. 

T. C. Bressoud and F. B. Schneider. Hypervisor-based fault 
tolerance. ACM Transactions on Computer Systems (TOCS), 
14(1):80-107, 1996. 

C. Chambers, W. Feng, W. Feng, and D. Saha. Mitigating infor- 
mation exposure to cheaters in real-time strategy games. In Pro- 
ceedings of the ACM International Workshop on Network and 
operating systems support for digital audio and video (NOSS- 
DAV), June 2005. 

J. Chow, T. Garfinkel, and P. M. Chen. Decoupling dynamic 
program analysis from execution in virtual environments. In 
Proceedings of the USENIX Annual Technical Conference, June 
2008. 

B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and 
A. Warfield. Remus: High availability via asynchronous virtual 
machine replication. In Proceedings of the USENIX Symposium 
on Networked Systems Design and Implementation (NSDI), Apt. 
2008. 

J. Dabrowski and E. V. Munson. Is 100 milliseconds too fast? In 
Proceedings of the ACM SIGCHI Conference on Human Factors 
in Computing Systems (CHI), Apr. 2001. 

J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Determinis- 
tic shared memory multiprocessing. In Proceedings of the ACM 
International Conference on Architectural Support for Program- 
ming Languages and Operating Systems (ASPLOS), Mar. 2009. 
R. Dingledine, M. J. Freedman, and D. Molnar. Peer-to-Peer: 
Harnessing the Power of Disruptive Technologies, chapter Ac- 
countability. O’Reilly and Associates, 2001. 

G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M. Chen. 
ReVirt: Enabling intrusion analysis through virtual-machine 
logging and replay. In Proceedings of the USENIX Symposium 
on Operating System Design and Implementation (OSDI), Dec. 
2002. 

G. W. Dunlap, D. Lucchetti, P. M. Chen, and M. Fetterman. Ex- 
ecution replay for multiprocessor virtual machines. In Proceed- 
ings of the ACM/USENIX International Conference on Virtual 
Execution Environments (VEE), Mar. 2008. 

T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. 
Terra: A virtual machine-based platform for trusted computing. 
In Proceedings of the ACM Symposium on Operating Systems 
Principles (SOSP), Oct. 2003. 

A. Haeberlen. A case for the accountable cloud. In Proceedings 
of the ACM SIGOPS International Workshop on Large-Scale 
Distributed Systems and Middleware (LADIS), Oct. 2009. 

A. Haeberlen, P. Aditya, R. Rodrigues, and P. Druschel. Ac- 
countable virtual machines. Technical Report 2010-3, Max 
Planck Institute for Software Systems, Sept. 2010. 

A. Haeberlen, I. Avramopoulos, J. Rexford, and P. Druschel. Ne- 
tReview: Detecting when interdomain routing goes wrong. In 
Proceedings of the USENIX Symposium on Networked Systems 
Design and Implementation (NSDI), Apr. 2009. 

A. Haeberlen, P. Kuznetsov, and P. Druschel. PeerReview: Prac- 
tical accountability for distributed systems. In Proceedings of 
the ACM Symposium on Operating Systems Principles (SOSP), 
Oct. 2007. 

A. Haeberlen, P. Kuznetsov, and P. Druschel. PeerReview: Prac- 
tical accountability for distributed systems. Technical Report 
2007-3, Max Planck Institute for Software Systems, Oct. 2007. 
G. Hoglund. 4.5 million copies of EULA-compliant spyware. 
http://www. rootkit .com/blog.php?newsid=358. 
G. Hoglund and G. McGraw. Exploiting Online Games: Cheat- 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


[40] 


[41] 


[42] 


[43] 


[44] 


ing Massively Distributed Systems. Addison-Wesley, 2007. 

S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating 
systems with time-traveling virtual machines. In Proceedings of 
the USENIX Annual Technical Conference, Apr. 2005. 

B. W. Lampson. Computer security in the real world. In Pro- 
ceedings of the Annual Computer Security Applications Confer- 
ence (ACSAC), Dec. 2000. 

P. Laskowski and J. Chuang. Network monitors and contract- 
ing systems: competition and innovation. In Proceedings of the 
ACM SIGCOMM Conference (SIGCOMM), Sept. 2006. 

D. Lee, M. Said, S. Narayanasamy, Z. Yang, and C. Pereira. 
Offline symbolic analysis for multi-processor execution replay. 
In Proceedings of the IEEE/ACM International Symposium on 
Microarchitecture (MICRO), Dec. 2009. 

D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. 
Chen, and J. Flinn. Respec: Efficient online multiprocessor re- 
play via speculation and external determinism. In Proceedings 
of the ACM International Conference on Architectural Support 
for Programming Languages and Operating Systems (ASPLOS), 
Mar. 2010. 

D. Levin, J. R. Douceur, J. R. Lorch, and T. Moscibroda. TrInc: 
Small trusted hardware for large distributed systems. In Pro- 
ceedings of the USENIX Symposium on Networked Systems De- 
sign and Implementation (NSDI), Apr. 2009. 

N. Michalakis, R. Soulé, and R. Grimm. Ensuring content in- 
tegrity for untrusted peer-to-peer content distribution networks. 
In Proceedings of the USENIX Symposium on Networked Sys- 
tems Design and Implementation (NSDI), Apr. 2007. 

C. Monch, G. Grimen, and R. Midtstraum. Protecting online 
games against cheating. In Proceedings of the Workshop on Net- 
work and Systems Support for Games (NetGames), Oct. 2006. 
J. Newsome and D. Song. Dynamic taint analysis for automatic 
detection, analysis, and signature generation of exploits on com- 
modity software. In Proceedings of the Annual Network and 
Distributed Systems Security Symposium (NDSS), Feb. 2005. 

T. Okamoto. A fast signature scheme based on congruential 
polynomial operations. EEE Transactions on Information The- 
ory, 36(1):47-53, 1990. 

PunkBuster web site. http: //www.evenbalance.com/. 
A. Seshadri, M. Luk, E. Shi, A. Perrig, L. van Doorn, and 
P. Khosla. Pioneer: Verifying code integrity and enforcing un- 
tampered code execution on legacy systems. In Proceedings of 
the ACM Symposium on Operating Systems Principles (SOSP), 
Oct. 2005. 

A. Smith. ASUS releases games cheat drivers. 
http://www.theregister.co.uk/2001/05/10/ 
asus_releases_games_cheat_drivers/, May 2001. 
Valve Corporation. Valve anti-cheat system (VAC). https: 
//support .steampowered.com/kb-article.php? 
ref=7849-RADZ- 6869. 

M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and 
B. Weissman. ReTrace: Collecting execution trace with vir- 
tual machine deterministic replay. In Proceedings of the Annual 
Workshop on Modeling, Benchmarking, and Simulation (MoBS), 
June 2007. 

C. Yan, D. Englender, M. Prvulovic, B. Rogers, and Y. Solihin. 
Improving cost, performance, and security of memory encryp- 
tion and authentication. ACM SIGARCH Computer Architecture 
News, 34(2):179-190, 2006. 

J. Yan and B. Randell. A systematic classification of cheating in 
online games. In Proceedings of the Workshop on Network and 
Systems Support for Games (NetGames), Oct. 2005. 

S. Yang, A. R. Butt, Y. C. Hu, and S. P. Midkiff. Trust but 
verify: Monitoring remotely executing programs for progress 
and correctness. In Proceedings of the ACM SIGPLAN Annual 
Symposium on Principles and Practice of Parallel Programming 
(PPoPP), June 2005S. 

A. R. Yumerefendi and J. S. Chase. Trust but verify: Account- 
ability for Internet services. In Proceedings of the ACM SIGOPS 
European Workshop, Sep 2004. 

A. R. Yumerefendi and J. S. Chase. Strong accountability for 
network storage. ACM Transactions on Storage (TOS), 3(3):11, 
Oct. 2007. 


USENIX Association 


USENIX Association 


Bypassing Races in Live Applications with Execution Filters 


Jingyue Wu, Heming Cui, Junfeng Yang 
{jingyue, heming, junfeng} @cs.columbia.edu 
Computer Science Department 
Columbia University 
New York, NY 10027 


Abstract 


Deployed multithreaded applications contain many races 
because these applications are difficult to write, test, and 
debug. Worse, the number of races in deployed applica- 
tions may drastically increase due to the rise of multicore 
hardware and the immaturity of current race detectors. 

LOOM is a “live-workaround” system designed to 
quickly and safely bypass application races at runtime. 
LOOM provides a flexible and safe language for develop- 
ers to write execution filters that explicitly synchronize 
code. It then uses an evacuation algorithm to safely in- 
stall the filters to live applications to avoid races. It re- 
duces its performance overhead using hybrid instrumen- 
tation that combines static and dynamic instrumentation. 

We evaluated LOOM on nine real races from a diverse 
set of six applications, including MySQL and Apache. 
Our results show that (1) LOOM can safely fix all evalu- 
ated races in a timely manner, thereby increasing appli- 
cation availability; (2) LOOM incurs little performance 
overhead; (3) LOOM scales well with the number of ap- 
plication threads; and (4) LOOM is easy to use. 


1 Introduction 


Deployed multithreaded applications contain many races 
because these applications are difficult to write, test, and 
debug. These races include data races, atomicity viola- 
tions, and order violations [33]. They can cause applica- 
tion crashes and data corruptions. Worse, the number of 
“deployed races” may drastically increase due to the rise 
of multicore and the immaturity of race detectors. 

Many previous systems can aid race detection (e.g., 
[31, 32, 37, 47, 54]), replay [9, 18, 28, 36, 43], and diag- 
nosis [42, 49]. However, they do not directly address de- 
ployed races. A conventional solution to fixing deployed 
races is software update, but this method requires appli- 
cation restarts, and is at odds with high availability de- 
mand. Live update systems [10, 12, 15, 35, 38, 39, 51] 
can avoid restarts by adapting conventional patches into 
hot patches and applying them to live systems, but the 


reliance on conventional patches has two problems. 

First, due to the complexity of multithreaded applica- 
tions, race-fix patches can be unsafe and introduce new 
errors [33]. Safety is crucial to encourage user adoption, 
yet automatically ensuring safety is difficult because con- 
ventional patches are created from general, difficult-to- 
analyze languages. Thus, previous work [38, 39] had to 
resort to extensive programmer annotations. 

Second, creating a releasable patch from a correct di- 
agnosis can still take time. This delay leaves buggy ap- 
plications unprotected, compromising reliability and po- 
tentially security. This delay can be quite large: we an- 
alyzed the Bugzilla records of nine real races and found 
that this delay can be days, months, or even years. Ta- 
ble 1 shows the detailed results. 

Many factors contribute to this delay. At a minimum 
level, a conventional patch has to go through code re- 
view, testing, and other mandatory software develop- 
ment steps before being released, and these steps are all 
time-consuming. Moreover, though a race may be fixed 
in many ways (e.g., lock-free flags, fine-grained locks, 
and coarse-grained locks), developers are often forced to 
strive for an efficient option. For instance, two of the 
bugs we analyzed caused long discussions of more than 
30 messages, yet both can be fixed by adding a single 
critical section. Performance pressure is perhaps why 
many races were not fixed by adding locks [33]. 

This paper presents LOOM, a “live-workaround” sys- 
tem designed to quickly protect applications against 
races until correct conventional patches are available and 
the applications can be restarted. It reflects our belief 
that the true power of live update is its ability to pro- 
vide immediate workarounds. To use LOOM, developers 
first compile their application with LOOM. At runtime, 
to workaround a race, an application developer writes an 
execution filter that synchronizes the application source 
to filter out racy thread interleavings. This filter is kept 
separate from the source. Application users can then 
download the filter and, for immediate protection, install 
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Race ID Report —_ Diagnosis Fix Release 
Apache-25520 12/15/03 12/18/03 01/17/04 ~——-03/19/04 
Apache-21287 07/02/03 N/A 12/18/03 03/19/04 
Apache-46215 11/14/08 N/A N/A N/A 

MySQL-169 03/19/03 N/A 03/24/03 06/20/03 
MySQL-644 06/12/03 N/A N/A 05/30/04 
MySQL-791 07/04/03. =: 07/04/03. 07/14/03 ~——- 07/22/03 
Mozilla-73761 03/28/01 03/28/01 04/09/01 05/07/01 
Mozilla-201134 04/07/03 = 04/07/03 ~—s: 04/16/03 —- 01/08/04 
Mozilla-133773 03/27/02 ~—- 03/27/02 12/01/09 01/21/10 


Table 1: Long delays in race fixing. We studied the delays 
in the fix process of nine real races; some of the races were 
extensively studied [9, 31, 33, 42, 43]. We identify each race 
by “Application — (Bugzilla #).” Column Report indicates 
when the race was reported, Diagnosis when a developer con- 
firmed the root cause of the race, Fix when the final fix was 
posted, and Release when the version of application contain- 
ing the fix was publicly released. We collected all dates by 
examining the Bugzilla record of each race. An N/A means 
that we could not derive the date. The days between diagno- 
sis and fix range from a few days to a month to a few years. 
For all but two races, the bug reports from the application users 
contained correct and precise diagnoses. Mozilla-201134 and 
Mozilla-133773 caused long discussions of more than 30 mes- 
sages, though both can be fixed by adding a critical region. 


it to their application without a restart. 

Loom decouples execution filters from application 
source to achieve safety and flexibility. Execution fil- 
ters are safe because LOOM’s execution filter language 
allows only well formed synchronization constraints. For 
instance, “code region r; and rg are mutually exclu- 
sive.” This declarative language is simpler to analyze 
than a general programing language such as C because 
LOOM need not reverse-engineer developer intents (e.g., 
what goes into a critical region) from scattered opera- 
tions (e.g., Lock () and unlock ()). 

As temporary workarounds, execution filters are more 
flexible than conventional patches. One main benefit is 
that developers can make better performance and relia- 
bility tradeoffs during race fixing. For instance, to make 
two code regions 7; and 72 mutually exclusive when they 
access the same memory object, developers can use crit- 
ical regions larger than necessary; they can make 7; and 
rg always mutually exclusive even when accessing dif- 
ferent objects; or in extreme cases, they can run 7; and r2 
in single-threaded mode. This flexibility enables quick 
workarounds; it can benefit even the applications that do 
not need live update. 

We believe the execution filter idea and the LOOM 
system as described are worthwhile contributions. To 
the best of our knowledge, LOOM is the first live- 
workaround system designed for races. Our additional 
technical contributions include the techniques we created 
to address the following two challenges. 
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A key safety challenge Loom faces is that even if 
an execution filter is safe by construction, installing it 
to a live application can still introduce errors because 
the application state may be inconsistent with the filter. 
For instance, if a thread is running inside a code region 
that an execution filter is trying to protect, a “double- 
unlock” error could occur. Thus, LOOM must (1) check 
for inconsistent states and (2) install the filter only in 
consistent ones. Moreover, LOOM must make the two 
steps atomic, despite the concurrently running applica- 
tion threads and multiple points of updates. This problem 
cannot be solved by a common safety heuristic called 
function quiescence [2, 13, 21, 39]. We thus create a 
new algorithm termed evacuation to solve this problem 
by proactively quiescing an arbitrary set of code regions 
given at runtime. We believe this algorithm can also ben- 
efit other live update systems. 

A key performance challenge LOOM faces is to main- 
tain negligible performance overhead during an appli- 
cation’s normal operations to encourage adoption. The 
main runtime overhead comes from the engine used to 
live-update an application binary. Although LOOM can 
use general-purpose binary instrumentation tools such as 
Pin, the overhead of these tools (up to 199% [34] and 
1065.39% in our experiments) makes them less suitable 
as options for LOOM. We thus create a hybrid instrumen- 
tation engine to reduce overhead. It statically transforms 
an application to include a “hot backup”, which can then 
be updated arbitrarily by execution filters at runtime. 

We implemented LOOM on Linux. It runs in user 
space and requires no modifications to the applications 
or the OS, simplifying deployment. It does not rely on 
non-portable OS features (e.g., SIGSTOP to pause appli- 
cations, which is not supported properly on Windows). 
LOOM’s static transformation is a plugin to the LLVM 
compiler [3], requiring no changes to the compiler either. 

We evaluated LOOM on nine real races from a diverse 
set of six applications: two server applications, MySQL 
and Apache; one desktop application PBZip2 (a parallel 
compression tool); and implementations of three scien- 
tific algorithms in SPLASH2 [7]. Our results show that 

1. LOoM is effective. It can flexibly and safely fix all 
races we have studied. It does not degrade applica- 
tion availability when installing execution filters. Its 
evacuation algorithm can install a fix within a second 
even under heavy workload, whereas a live update 
approach using function quiescence cannot install the 
fix in an hour, the time limit of our experiment. 

2. LOOM is fast. LOOM has negligible performance 
overhead and in some cases even speeds up the ap- 
plications. The one exception is MySQL. Running 
MySQL with Loom alone increases response time 
by 4.11% and degrades throughput by 3.76%. 

3. LOOM is scalable. Experiments on a 48-core ma- 
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Figure 1: LOOM overview. Its components are shaded. 


chine show that LOOM scales well as the number of 
application threads increases. 

4. LOOM is easy to use. Execution filters are concise, 
safe, and flexible (able to fix all races studied, often 
in more than one way). 

This paper is organized as follows. We first give an 
overview of LOOM (§2). We then describe LOOM’s exe- 
cution filter language (§3), the evacuation algorithm (84), 
and the hybrid instrumentation engine (85). We then 
present our experimental results (§6). We finally discuss 
related work (§7) and conclude (88). 


2 Overview 


Figure | presents an overview of LOOM. To use LOOM 
for live update, developers first statically transform their 
applications with LOOM’s compiler plugin. This plugin 
injects a copy of LOOM’s update engine into the applica- 
tion binary; it also collects the application’s control flow 
graphs (CFG) and symbol information on behalf of the 
live update engine. 

Loom’s compiler plugin runs within the LLVM com- 
piler [3]. We choose LLVM for its compatibility with 
GCC and its easy-to-analyze intermediate representation 
(IR). However, LOOM’s algorithms are general and can 
be ported to other compilers such as GCC. Indeed, for 
clarity we will present all our algorithms at the source 
level (instead of the LLVM IR level). 

To fix arace, application developers write an execution 
filter in LOOM’s filter language and distribute the filter 
to application users. A user can then install the filter to 
immediately protect their application by running 


% loomctl add <pid> <filter-file> 


Here loomctl is a user-space program called the 
LOOM controller that interacts with users and initiates 
live update sessions, pid denotes the process ID of a 
buggy application instance, and filter-fileisa file 
containing the execution filter. Under the hood, this con- 
troller compiles the execution filter down to a safe update 


plan using the CFGs and symbol information collected 
by the compiler plugin. This update plan includes three 
parts: (1) synchronization operations to enforce the con- 
straints described in the filter and where, in the applica- 
tion, to add the operations; (2) safety preconditions that 
must hold for installing the filter; and (3) sanity checking 
code to detect potential errors in the filter itself. The con- 
troller sends the update plan to the update engine running 
as a thread inside the application’s address space, which 
then monitors the runtime states of the application and 
carries out the update plan only when all the safety pre- 
conditions are satisfied. 

If LOOM detects a problem with a filter through one of 
its sanity checks, it can automatically remove the prob- 
lematic filter. It again waits for all the safety precondi- 
tions to hold before removing the filter. 

Users can also remove a filter manually, if for exam- 
ple, the race that the filter intends to fix turns out to be 
benign. They do so by running 


oe 


loomctl ls <pid> 
% loomctl remove <pid> <filter-id> 


The first command “loomct1 1s” returns a list of 
installed filter IDs within process pid. The sec- 
ond command “loomctl remove” removes filter 
filter—id from process pid. 

Users can replace an installed filter with a new filter, 
if for example the new filter fixes the same race but has 
less performance overhead. Users do so by running 


% loomctl replace <pid> <old-id> <new-file> 
where old-id is the ID of the installed filter, and 
new-file is a file containing the new filter. LOOM 
ensures that the removal of the old filter and the installa- 
tion of the new filter are atomic, so that the application is 
always protected from the given race. 





2.1 Usage Scenarios 


LOOM enables users to explicitly describe their synchro- 
nization intents and orchestrate thread interleavings of 
live applications accordingly. Using this mechanism, we 
envision a variety of strategies users can use to fix races. 
Live update At the most basic level, users can translate 
some conventional patches into execution filters, and use 
LOOM to install them to live applications. 

Temporary workaround Before a permanent fix (i.e., a 
correct source patch) is out, users can create an execu- 
tion filter as a crude, temporary fix to a race, to provide 
immediate protection to highly critical applications. 
Preventive fix When a potential race is reported (e.g., 
by automated race detection tools or users of the appli- 
cation), users can immediately install a filter to prevent 
the race suspect. Later, when developers deem this report 
false or benign, users can simply remove the filter. 
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Cooperative fix Users can share filters with each other. 
This strategy enjoys the same benefits as other coopera- 
tive protection schemes [17, 26, 44, 50]. One advantage 
of LOOM over some of these systems is that it automat- 
ically verifies filter safety, thus potentially reducing the 
need to trust other users. 

Site-specific fix Different sites have different workloads. 
An execution filter too expensive for one site may be fine 
for another. The flexibility of execution filters allows 
each site to choose what specific filters to install. 

Fix without live update For applications that do not 
need live update, users can still use LOOM to create quick 
workarounds, improving reliability. 

Besides fixing races, LOOM can be used for the op- 
posite: demonstrating a race by forcing a racy thread in- 
terleaving. Compared to previous race diagnosis tools 
that handle a fixed set of race patterns [25, 41, 42, 49], 
LOOM’s advantage is to allow developers to construct 
potentially complex “concurrency” testcases. 

Although LOoM can also avoid deadlocks by avoid- 
ing deadlock-inducing thread interleavings, it is less suit- 
able for this purpose than existing tools (e.g., Dimmu- 
nix [26]). To avoid races, LOOM’s update engine can add 
synchronizations to arbitrary program locations. This en- 
gine is overkill for avoiding deadlocks: intercepting lock 
operations (e.g., via LD_-PRELOAD) is often enough. 


2.2 Limitations 


LOOM is explicitly designed to work around (broadly de- 
fined) races because they are some of the most difficult 
bugs to fix and this focus simplifies LOOM’s execution 
filter language and safety analysis. LOOM is not intended 
for other classes of errors. Nonetheless, we believe the 
idea of high-level and easy-to-verify fixes can be gener- 
alized to many other classes of errors. 

LOOM does not attempt to fix occurred races. That 
is, if a race has caused bad effects (e.g., corrupted data), 
LOOM does not attempt to reverse the effects (e.g., re- 
cover the data). It is conceivable to allow developers to 
provide a general function that LOOM runs to recover 
occurred races before installing a filter. Although this 
feature is simple to implement, it makes safety analysis 
infeasible. We thus rejected this feature. 

Safety in LOOM terms means that an execution filter 
and its installation/removal processes introduce no new 
correctness errors to the application. However, similar 
to other safe error recovery [46] or avoidance [26, 52] 
tools, LOOM runs with the application and perturbs tim- 
ing, thus it may expose some existing application races 
because it makes some thread interleavings more likely 
to occur. Moreover, execution filters synchronize code, 
and may introduce deadlocks and performance prob- 
lems. LOOM can recover from filter-introduced dead- 
locks (83.3) using timeouts, but currently does not deal 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


with performance problems. 

At an implementation level, LOOM currently supports 
a fixed set of synchronization constraint types. Although 
adding new types of constraints is easy, we have found 
the existing constraint types sufficient to fix all races 
evaluated. Another issue is that LOOM uses debugging 
symbol information in its analysis, which can be inaccu- 
rate due to compiler optimization. This inaccuracy has 
not been a problem for the races in our evaluation be- 
cause LOOM keeps an unoptimized version of each basic 
block for live update (§5). 


3 Execution Filter Language 


LOOM’s execution filter language allows developers to 
explicitly declare their synchronization intents on code. 
This declarative approach has several benefits. First, 
it frees developers from the low-level details of syn- 
chronization, increasing race fixing productivity. Sec- 
ond, it also simplifies LOOM’s safety analysis because 
LOOM does not have to reverse-engineer developer in- 
tents (e.g., what goes into a critical section) from low- 
level synchronization operations (e.g., scattered Lock () 
and unlock () ), which can be difficult and error-prone. 
Lastly, LOOM can easily insert error-checking code for 
safety when it compiles a filter down to low-level syn- 
chronization operations. 


3.1 Example Races and Execution Filters 


In this section, we present two real races and the execu- 
tion filters to fix them to demonstrate LOOM’s execution 
filter language and its flexibility. 

The first race is in MySQL (Bugzilla # 791), which 
causes the MySQL on-disk transaction log to miss 
records. Figure 2 shows the race. The code on the 
left (function new_file () ) rotates MySQL’s transac- 
tion log file by closing the current log file and opening 
a new one; it is called when the transaction log has to 
be flushed. The code on the right is used by MySQL to 
append a record to the transaction log. It uses double- 
checked locking and writes to the log only when the log 
is open. The race occurs if the racy is_open() (T2, 
line 3) catches a closed log when thread T1 is between 
the close () (TI, line 5) and the open () (TI, line 6). 

Although a straightforward fix to the race exists, per- 
formance demands likely forced developers to give up 
the fix and choose a more complex one instead. The 
straightforward fix should just remove the racy check 
(T2, line 3). Unfortunately, this fix creates unneces- 
sary overhead if MySQL is configured to skip logging 
for speed; this overhead can increase MySQL’s response 
time by more than 10% as observed in our experiments. 
The concern to this overhead likely forced MySQL de- 
velopers to use a more involved fix, which adds a new 
flag field to MySQL’s transaction log and modifies the 
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// log.cc. thread T1 
void MYSQL_LOG::new-file(){ 
lock(&LOCK_log); 


1: / sql_insert.cc. thread T2 

2: // [race] may return false 

3: if (mysql_bin_log.is_open()){ 

lock(&LOCK_log); 

if (mysql_bin_log.is_open()){ 
... 1 write to log 


close(); // log is closed 
open(. . .); 


ON Os 


unlock(&LOCK log); 
} 9: } 
Figure 2: A real MySQL race, slightly modified for clarity. 


unlock(&LOCK_log); 


// Execution filter 1: unilateral exclusion 
{log.ce:5, log.cc:}6} <> * 

// Execution filter 2: mutual exclusion of code 
{log.ce:5, log.cc:6} <> MYSQL-_-LOG::is_open 


// Execution filter 3: mutual exclusion of code and data 
{log.cc:5 (this), log.cc:6 (this)} <> MYSQL-_LOG::is_open(this) 


Figure 3: Execution filters for the MySQL race in Figure 2. 


close() function to distinguish a regular close () 
call and one for reopening the log. 

In contrast, LOOM allows developers to create tem- 
porary workarounds with flexible performance and reli- 
ability tradeoffs. These temporary fixes can protect the 
application until developers create a correct and efficient 
fix at the source level. Figure 3 shows several execu- 
tion filters that can fix this race. Execution filter 1 in the 
figure is the most conservative fix: it makes the code re- 
gion between TI, line 5 and T1, line 6 atomic against 
all code regions, so that when a thread executes this re- 
gion, all other threads must pause. We call such a syn- 
chronization constraint unilateral exclusion in contrast to 
mutual exclusion that requires participating threads agree 
on the same lock.' Here operator “<>” expresses mutual 
exclusion constraints, its first operand “{log.cc:5, 
log.cc:6}” specifies a code region to protect, and its 
second operand “x” represents all code regions. This 
“expensive” fix incurs only 0.48% overhead (86.1) be- 
cause the log rotation code rarely executes. 

Execution filter 2 reduces overhead by refining 
the “x” operand to a specific code region, function 
MYSQL_LOG: :is_open(). This filter makes the two 
code regions mutually exclusive, regardless of what 
memory locations they access. Execution filter 3 further 
improves performance by specifying the memory loca- 
tion accessed by each code region. 

The second race causes PBZip2 to crash due to a use- 
after-free error. Figure 4 shows the race. The crash oc- 
curs when fifo is dereferenced (line 10) after it is freed 
(line 5). The reason is that the main () thread does not 
wait for the decompress () threads to finish. To fix 
this race, developers can use the filter in Figure 5, which 
constrains line 10 to run for numCPU times before line 5. 


'Note that unilateral exclusion differs (subtly) from single-threaded 
execution: unilateral exclusion allows no context switches. 
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// pbzip2.cpp. thread T2 
7 : void *“decompress(void *q){ 
queue “fifo = (queue *)q; 


// pbzip2.cpp. thread TI 
1: main() { 
2: for(i=0;i<numCPU;i++) : : 


3 pthread_create(. .., Daas 
4: decompress, fifo); 10: — pthread_mutex_lock(fifo—>mut); 
5 queueDelete(fifo); TT sas 

6: 12: } 


Figure 4: A real PBZip2 race, simplified for clarity. 


pbzip2.cpp:10 {numCPU} > pbzip2.cpp:5 
Figure 5: Execution filter for the PBZip2 race in Figure 4. 


3.2 Syntax and Semantics 


Table 2 summarizes the main syntax and semantics of 
Loom’s execution filter language. This language al- 
lows developers to express synchronization constraints 
on events and regions. An event in the simplest form is 
“file : line,’ which represents a dynamic instance of 
a static program statement, identified by file name and 
line number. An event can have an additional “(expr)” 
component and an “{n}” component, where expr and 
n refer to valid expressions with no function calls or 
dereferences. The expr expression distinguishes differ- 
ent dynamic instances of program statements and LOOM 
synchronizes the events only with matching expr values. 
The n expression specifies the number of occurrences of 
an event and is used in execution order constraints. A 
region represents a dynamic instance of a static code re- 
gion, identified by a set of entry and exist events or an 
application function. A region representing a function 
call can have an additional “(args)” component to dis- 
tinguish different calls to the same function. 

LOOM currently supports three types of synchroniza- 
tion constraints (the bottom three rows in Table 2). Al- 
though adding new constraint types is easy, we have 
found existing ones enough to fix all races evaluated. An 
execution order constraint as shown in the table makes 
event e; happen before e2, e2 before e3, and so forth. A 
mutual exclusion constraint as shown makes every pair 
of code regions r; and r; mutually exclusive with each 
other. A unilateral exclusion constraint conceptually 
makes the execution of a code region single-threaded. 


3.3. Language Implementation 


LOOM implements the execution filter language using 
locks and semaphores. Given an execution order con- 
straint e; > e€;,1, LOOM inserts a semaphore up () op- 
eration at e; and a down () operation at e;4;. LOOM 
implements a mutual exclusion constraint by inserting 
lock () at region entries and unlock () at region ex- 
its. LOOM implements a unilateral exclusion constraint 
reusing the evacuation mechanism (84), which can pause 
threads at safe locations and resume them later. 
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Constructs Syntax 
Event (short as e) fue ‘ uae 
file : line (expr) 


e{n}, nis # of occurrence 
POF 010 CF CRT octyl 

func (args) 

€1 > 62 >... > en 

T1 <> 72. <> ae SS Tn 
r<>* 





Region (short as r) 





Execution Order 
Mutual Exclusion 
Unilateral Exclusion 








Table 2: Execution filter language summary. 


LOOM creates the needed locks and semaphores on 
demand. The first time a lock or semaphore is refer- 
enced by one of the inserted synchronization operations, 
LOOM creates this synchronization object based on the 
ID of the filter, the ID of the constraint, and the value of 
expr if present. It initializes a lock to an unlocked state 
and a semaphore to 0. It then inserts this object into a 
hash table for future references. To limit the size of this 
table, LOOM garbage-collects these synchronization ob- 
jects. Freeing a synchronization object is safe as long 
as it is unlocked (for locks) or has a counter of O (for 
semaphores). If this object is referenced later, LOOM 
simply re-creates it. The default size of this table is 256 
and LOOM never needed to garbage-collect synchroniza- 
tion objects in our experiments. 

The up () and down () operations LOOM inserts be- 
have slightly differently than standard semaphore oper- 
ations when n, the number of occurrences, is specified. 
Given el{ni} > e2{n2}, up () conceptually increases 
the semaphore counter by a and down() decreases 


it by at Our implementation uses integers instead of 
floats. LOOM stores the value of n the first time the cor- 
responding event runs and ignores future changes of n. 

LOOM computes the values of expr and n using de- 
bugging symbol information. We currently allow expr 
and n to be the following expressions: a (constant or 
primitive variable), atb, &a, &a[i], &a—>f, or any 
recursive combinations of these expressions. For safety, 
we do not allow function calls or dereferences. These 
expressions are sufficient for writing the execution filters 
in our evaluation. 

We implemented this feature using the DWARF li- 
brary and the parse_exp_1() function in GDB. 
Specifically, we use parse_exp_1 () to parse the expr 
or m component into an expression tree, then compile this 
tree into low level instructions by querying the DWARF 
library. Note this compilation step is done inside the 
LOOM controller, so that the live update engine does not 
have to pay this overhead. 

LOOM implements three mechanisms for safety. First, 
by keying synchronization objects based on filter and 
constraint IDs, it uses a disjoint set of synchronization 
objects for different execution filters and constraints, 
avoiding interference among them. Second, LOOM in- 
serts additional checking code when it generates the up- 
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Figure 6: Unsafe program states for installing filters. 


date plan. For example, given a code region c in a mu- 
tual exclusion constraint, LOOM checks for errors such 
as c’s unlock () releasing a lock not acquired by c’s 
lock (). Lastly, LOoM checks for filter-induced dead- 
locks to guard against buggy filters. If a buggy filter 
introduces a deadlock, one of its synchronization oper- 
ations must be involved in the wait cycle. LOOM de- 
tects such deadlocks using timeouts, and automatically 
removes the offending filter. 


4 Avoiding Unsafe Application States 


Figure 6 shows three unsafe scenarios LOOM must han- 
dle. For a mutual exclusion constraint that turns code 
regions into critical sections, LOOM must ensure that 
no thread is executing within the code regions when in- 
stalling the filter to avoid “double-unlock” errors. Simi- 
larly, for an execution order constraint e; > e2, LOOM 
must ensure either of the following two conditions when 
installing the filter: (1) both e; and eg have occurred or 
(2) neither has occurred; otherwise the up () LOOM in- 
serts at e; may get skipped or wake up a wrong thread. 

Note that a naive approach is to simply ignore an 
unlock() if the corresponding lock is already un- 
locked, but this approach does not work with execution 
order constraints. Moreover, it mixes unsafe program 
states with buggy filters, and may reject correct filters 
simply because it tries to install the filters at unsafe pro- 
gram states. 

A common safety heuristic called function quies- 
cence [2, 13, 21, 39] cannot address this unsafe state 
problem. This technique updates a function only when 
no stack frame of this function is active in any call stack 
of the application. Unfortunately, though this technique 
can ensure safety for many live updates, it is insufficient 
for execution filters because their synchronization con- 
straints may affect multiple functions. 

We demonstrate this point using a race example. Fig- 
ure 7 shows the worker thread code of a contrived 
database. Function process_client () 1s the main 
thread function. It takes a client socket as input and re- 
peatedly processes requests from the socket. For each 
request, function process_client () opens the cor- 
responding database table by calling open_table(), 
serves the request, and closes the table by calling 
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1: YW database worker thread 

2: void handle_client(int fd) { 

3: for(::) { 

4: struct client_req req; 

5 int ret = recv(fd, &req, ...); 
6: if(ret <= 0) break; 

C3 open_table(req.table_id); 

8: ... / do real work 

9: close_table(req.table_id); 

10: } 

11: 

12: void open_table(int table_id) { 

13: // fix: acquire table lock 

14: ... 4/ actual code to open table 
15: 

16: void close_table(int table_id) { 

17: ... // actual code to close table 
18: // fix: release table lock 

19: } 


Figure 7: A contrived race. 


close_table(). The race in Figure 7 occurs when 
multiple clients concurrently access the same table. 

To fix this race, an execution filter can add a lock ac- 
quisition at line 13 in open_table() and a lock re- 
lease at line 18 in close_table(). To safely install 
this filter, however, the quiescence of open_table () 
and close_table() is not enough, because a thread 
may still be running at line 8 and cause a double-unlock 
error. An alternative fix is to add the lock acquisition and 
release in function handle_client (), but this func- 
tion hardly quiesces because of the busy loop (line 3-10) 
and the blocking call recv (). 

LOOM solves the unsafe state program using an algo- 
rithm termed evacuation that can proactively quiesce ar- 
bitrary code regions. From a high level, this algorithm 
takes a filter and computes a set of unsafe program lo- 
cations that may interfere with the filter. It does so con- 
servatively to avoid marking an unsafe location as safe. 
Then, it “evacuates” threads out of the unsafe locations 
and blocks them at safe program location. After that, it 
installs the filter and resumes the threads. 


4.1 Computing Unsafe Program Locations 


LOoM uses slightly different methods to compute the un- 
safe program locations for mutual exclusion and for ex- 
ecution order constraints. To compute unsafe program 
locations for mutual exclusion constraints, LOOM per- 
forms a static reachability analysis on the interprocedu- 
ral control flow graph (ICFG) of an application. An 
ICFG connects each function’s control flow graphs by 
following function calls and returns. Figure 8a shows 
the ICFG for the code in Figure 7. We say statement s; 
reaches 89 or reachable(s1, 82) if there is a path from s; 
to sz on the ICFG. For example, the statement at line 13 
reaches the statement at line 8 in Figure 7. 

Given an execution filter f with mutual exclusion 
constraint r) <> rg <> <> Tn, LOOM in- 
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Figure 9: Evacuation. Curved lines represent application 
threads, solid triangles (in black) represents the threads’ 
program counters (PC), and solid stripes (in red) repre- 
sents an unsafe code region. 


cludes any statement s potentially inside one of the re- 
gions in unsafe(f), the set of unsafe program loca- 
tions for filter f. Specifically, unsafe(f) is the set 
of statements s such that {reachable(r;.entries, s) A 
reachable(s,r;.exits)} for i € [1,n], where r;.entries 
are the entry statements to region r; and r;.exzts are the 
exit statements. 

LOOM computes unsafe program locations for an ex- 
ecution order constraint by first deriving code regions 
from the constraint, then reusing the method for mutual 
exclusion to compute unsafe program locations. Specif- 
ically, given e; > €2 > ... > €n, LOOM first computes 
a dominator statement sq such that sg dominates all e; 
(i.e., Sq is on every path from the program start to e;); it 
then computes unsafe(f) as the set of statements inside 
each { sq; e;} region. 

Since e; may be in different threads, LOOM aug- 
ments the ICFG of an application into thread interpro- 
cedural control flow graph (TICFG) by adding edges 
for thread creation and thread join statements. Cur- 
rently our analysis constructs the TICFG by treating 
each pthread_create (func) statement as a func- 
tion call to func (): it adds an edge from the statement 
to the entry of func () and a thread join edge from the 
exit of func () to the statement. 


4.2 Controlling Application Threads 


LOOM needs to control application threads to pause and 
resume them. It does so using a read-write lock called the 
update lock. To live update an application, LOOM grabs 
this lock in write mode, performs the update, and releases 
this lock. To control application threads, LOOM’s com- 
piler plugin instruments the application so that the ap- 
plication threads hold this lock in read mode in normal 
operation and check for an update once in a while by re- 
leasing and re-grabbing this lock. 

LOOM carefully places update-checks inside an appli- 
cation to reduce the overhead and ensure a timely update. 
Figure 8b shows the placement of these update-checks. 
LOOM needs no update-checks inside straight-line code 
with no blocking calls because such code can complete 
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Figure 8: Static transformations that LOOM does for safe and fast live update. Subfigure (a) shows the ICFG of the 
code in Figure 7; (b) shows the resulting CFG of function process_client () after the instrumentation to control 
application threads (84); (c) shows the final CFG of function process_client () after basic block cloning (85). 


quickly. LOOM places one update-check for each cycle 
in the control flow graph, including loops and recursive 
function call chains, so that an application thread cycling 
in one of these cycles can check for an update at least 
once each iteration. Currently LOOM instruments the 
backedge of a loop and an arbitrary function entry in a 
recursive function cycle. LOOM does not instrument ev- 
ery function entry because doing so is costly. 

LOoM also instruments an application to release the 
update lock before a blocking call and re-grab it af- 
ter the call, so that an application thread blocking on 
the call does not delay an update. For the example in 
Figure 7, LOOM can perform the update despite some 
threads blocking in recv(). LOOM instruments only 
the “leaf-level” blocking calls. That is, if foo() calls 
bar() and bar () is blocking, LOOM instruments the 
calls to bar (), but not the calls to foo(). Currently 
LOOM conservatively considers calls to external func- 
tions (i.e., functions without source), except Math library 
functions, as blocking to save user annotation effort. 


4.3 Pausing at Safe Program Locations 


Besides the update lock, LOOM uses additional syn- 
chronization variables to ensure that application threads 
pause at safe locations. LOOM assigns a wait flag for 
each backedge of a loop and the chosen function entry 
of a recursive call cycle. To enable/disable pausing at a 
safe/unsafe location, LOOM sets/clears the correspond- 
ing flag. The instrumentation code for each CFG cycle 
(left of Figure 10) checks for an update only when the 
corresponding wait flag is set. These wait flags allow ap- 
plication threads at unsafe program locations to run until 
they reach safe program locations, effectively evacuating 
the unsafe program locations. 
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// inserted at CFG cycle 
void cycle_check() { 
if(wait[stmt_id]) { 
read_unlock(&update); 
while(wait[stmt_id]); } 
read_lock(&update); // inserted after blocking call 
} void after_blocking() { 
} read_lock(&update) ; 
atomic_dec(&counter[callsite_id]); 


} 


Figure 10: Instrumentation to pause application threads. 


// inserted before blocking call 

void before_blocking() { 
atomic_inc(&counter[callsite_id]); 
read_unlock(&update); 


Note that the statement “if (wait [stmt_id])” in 
Figure 10 greatly improves LOOM’s performance. With 
this statement, application threads need not always re- 
lease and re-grab the update lock which can be costly, 
and hardware cache and branch prediction can effectively 
hide the overhead of checking these flags. This technique 
speeds up LOOM significantly (§6) because wait flags are 
almost always 0 with read accesses. 

LOOM cannot use the wait-flag technique to skip a 
blocking function call because doing so changes the ap- 
plication semantics. Instead, LOOM assigns a counter to 
each blocking callsite to track how many threads are at 
the callsites (right of Figure 10). LOOM uses a counter 
instead of a binary flag because multiple threads may be 
doing the same call. 

Now that LOOM’s instrumentation is in place, Fig- 
ure 11 shows LOOM’s evacuation method which runs 
within LOOM’s live update engine. This method first sets 
the wait flags for safe backedges. It then grabs the update 
lock in write mode, which pauses all application threads. 
It then examines the counters of unsafe callsites and if 
any counter is positive, it releases the update lock and 
retries, so that the thread blocked at unsafe callsites can 
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volatile int wait{NBACKEDGE] = {0}; 
volatile int counter[NCALLSITE] = {0}; 
rwlock_t update; 
void evacuate() { 
for each B in safe backedges 
wait[B] = 1; / turn on wait flags 
retry: 
write_lock(&update); // pause app threads 
for each C in unsafe callsites 
if(counter[C]) { // threads paused at unsafe callsites 
write_unlock(&update); 
goto retry; 


... / update 
for each B in safe backedges 

wait[B] = 0; / turn off wait flags 
write_unlock(&update); // resume app threads 


} 


Figure 11: Pseudo code of the evacuation algorithm. 


wake up and advance to safe locations. Next, it updates 
the application (85), clears the wait flags, and releases 
the update lock. 


4.4 Correctness Discussion 


We briefly discuss the correctness of our evacuation al- 
gorithm in this subsection; for a complete proof, please 
refer to our technical report [53]. 

In program analysis terms, our reachability analysis 
(84.1) is interprocedural and flow-sensitive. We use 
a crude pointer analysis to discover thread functions, 
thread join sites, and function pointer targets. We could 
have refined our analysis to improve precision, but we 
find it sufficient to compute unsafe locations for all eval- 
uated races because (1) our analysis is sound and never 
marks an unsafe location safe and (2) execution filters 
are quite small and slight imprecision does not matter. In 
the worst case, if our analysis turns out too imprecise for 
some filters, the flexibility of LOOM allows developers 
to easily adjust their filters to pass the safety analysis. 

Server programs frequently use thread pools, creat- 
ing problems for our reachability analysis. Specifically, 
these servers tend to create a fixed set of threads dur- 
ing initialization, then reuse them for independent re- 
quests. If we compute dominators using the creation sites 
of these threads, we would find that dominators only run 
during server initialization. Fortunately, we can anno- 
tate the reuse of a thread as a special thread creation site, 
so that our algorithm computes correct dominators. In 
our experiments, we did not (and need not) annotate any 
thread reuse. 

Our reachability analysis gives correct results de- 
spite compiler reordering. In order to pause application 
threads at safe locations, our reachability analysis returns 
only the set of unsafe backedges and external callsites. 
These locations are instrumented by LOOM; this instru- 
mentation acts as barriers and prevents compilers from 


void slot(int stmt_id) { 
op-list = operations[stmt_id]; 
foreach op in op_list 
do op; 


Figure 12: Slot function. 


reordering instructions across them. 

The synchronization between the instrumentation in 
Figure 10 and the evacuation algorithm in Figure 11 is 
correct under two conditions: (1) read and write to wait 
flags are atomic and (2) the operations to the update lock 
contain correct memory barriers that prevent hardware 
reordering. Currently we implement wait flags using 
aligned integers; our update lock operations use atomic 
operations similar to the Linux kernel’s rw-spinlock. 
Thus, our evacuation algorithm works correctly on X86 
and AMD64 which do not reorder instructions across 
atomic instructions. We expect our algorithm to work on 
other commodity hardware that also provides this guar- 
antee. To cope with more relaxed hardware (e.g., , Al- 
pha), we can augment these operations with full barriers. 


5 Hybrid Instrumentation 


Most previous live update systems update binaries by 
compiling updated functions and redirecting old func- 
tions to the new function binaries using a table or jump 
instructions. This approach requires source patches to 
generate the updates, thus it has the limitations described 
in §1. Moreover, this approach pays the overhead of po- 
sition independent code (PIC) because application func- 
tions must be compiled as PIC for live update. It also suf- 
fers the aforementioned function quiescence problem.” 

Another alternative is to use general-purpose binary 
instrumentation tools such as vx32 [20], Pin [34] and Dy- 
namoRIO [14], but they tend to incur significant runtime 
overhead just to run their frameworks alone. For exam- 
ple, Pin has been reported to incur 199% overhead [34], 
and we observed 10 times slowdown on Apache with a 
CPU-bound workload (86). 

Loom’s hybrid instrumentation engine reduces run- 
time overhead by combining static and dynamic instru- 
mentation. This engine statically transforms an applica- 
tion’s binary to anticipate dynamic updates. The static 
transformation pre-pads, before each program location, 
a slot function which interprets the updates to this pro- 
gram location at runtime. Figure 12 shows the pseudo 
code of this function. It iterates though a list of synchro- 
nization operations assigned to the current statement and 
performs each. To update a program location at runtime, 
LOOM simply modifies the corresponding operation list. 

Inserting the slot function at every statement incurs 


?The function quiescence problem can be addressed by transform- 
ing loop bodies into functions [38, 39] but only if the CFGs are re- 
ducible [23]. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) —-143 


144 





Race ID Description 

MySQL-791 Calls to close () and open () to flush log file 
are not atomic. Figure 2 shows the code. 

MySQL-169 Table update and log write in mysql_delete () 
are not atomic. 

MySQL-644 Calls to prepare() and optimize () 


mysql_select () are not atomic. 
Apache-21287 
atomic. 
Apache-25520 
sulting in corrupted logs or crashes. 


PBZip2 Variable fifo is used in one thread after being 
freed by another. Figure 4 shows the code. 

SPLASH2-fft Variable finishtime and initdonetime 
are read before assigned the correct values. 

SPLASH2-lu Variable rf is read before assigned the correct 


value. 
SPLASH2-barnes 
correct value. 


Table 3: All races used in evaluation. We identify 
races in MySQL and Apache as “(application name) — 
(Bugzilla #)”, the only race in PBZip2 “PB Zip2”, and races 
in SPLASH2 “SPLAS H2 — (benchmark name)”. 


high runtime overhead and hinders compiler optimiza- 
tion. LOOM solves this problem using a basic block 
cloning idea [29]. LOOM keeps two versions of each 
basic block in the application binary, an originally com- 
piled version that is optimized, and a hot backup that is 
unoptimized and padded for live update. To update a ba- 
sic block at runtime, LOOM simply updates the backup 
and switches the execution to the backup by flipping a 
switch flag. 

LOOM instruments only function entries and loop 
backedges to check the switch flags because doing so for 
each basic block is expensive. Similar to the wait flags in 
(84), the switch flags are almost always 0, so that hard- 
ware cache and branch predication can effectively hide 
the overhead of checking them. This technique makes 
live-update-ready applications run as fast as the origi- 
nal application during normal operations (86). Figure 8c 
shows the final results after all LOOM transformations. 

Note that the accesses to switch flags are correctly pro- 
tected by the update lock. An application checks the 
switch flag when holding the update lock in read mode, 
and the update engine sets the switch flag when holding 
the update lock in write mode. 


6 Evaluation 


We implemented LOOM in Linux. It consists of 4,852 
lines of C++ code, with 1,888 lines for the LLVM com- 
piler plugin, 2,349 lines for the live-update engine, and 
615 lines for the controller. 

We evaluated LOOM on nine real races from a diverse 
set of applications, ranging from two server applications 
MySQL [5] and Apache [11], to one desktop application 
PBZip2 [6], to three scientific applications fft, lu, and 
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Reference count decrement and checking are not 


Threads write to same log buffer concurrently, re- 


Variable t rackt ime is read before assigned the 


barnes in SPLASH2 [7]. Table 3 lists all nine races. 
Our race selection criteria is simple: (1) they are exten- 
sively used in previous studies [31, 42, 43] and (2) the 
application can be compiled by LLVM and the race can 
be reproduced on our main evaluation machine, a 2.66 
GHz Intel quad-core machine with 4 GB memory run- 
ning 32-bit Linux 2.6.24. 

We used the following workloads in our experi- 
ments. For MySQL, we used SysBench [8] (advanced 
transaction workload), which randomly selects, updates, 
deletes, and inserts database records. For Apache, we 
used ApacheBench [1], which repeatedly downloads a 
webpage. Both benchmarks are multithreaded and used 
by the server developers. We made both SysBench and 
ApacheBench CPU bound by fitting the database or web 
contents within memory; we also ran both the client 
and the server on the same machine, to avoid mask- 
ing LOOM’s overhead with the network overhead. Un- 
less otherwise specified, we ran 16 worker threads for 
MySQL and Apache because they performed best with 
8-16 threads. We ran four worker threads for PBZip2 and 
SPLASH2 applications because they are CPU-intensive 
and our evaluation machine has four cores. 

We measured throughput (TPUT) and response time 
(RESP) for server applications and overall execution 
time for other applications. We report LOOM’s relative 
overhead, the smaller the better. We compiled the appli- 
cations down to x86 instructions using llvym-—gcc -O2 
and LLVM’s bitcode compiler 11c. For all the perfor- 
mance numbers reported, we repeated the experiment 50 
times and take the average. 

We focus our evaluation on five dimensions: 

1. Overhead. Does LOOM incur low overhead? 

2. Scalability. Does LOOM scale well as the number of 
application threads increases? 

3. Reliability. Can LOOM be used to fix the races listed 
in Table 3? What are the performance and reliability 
tradeoffs of execution filters? 

4. Availability. Does LOOM severely degrade applica- 
tion availability when execution filters are installed? 

5. Timeliness. Can LOOM install fixes in a timely way? 


6.1 Overhead 


Figure 13 shows the performance overhead of LOOM 
during the normal operations of the applications. We 
also show the overhead of bare Pin for reference. LOOM 
incurs little overhead for Apache and SPLASH2 bench- 
marks. It increases MySQL’s response time by 4.11% 
and degrades its throughput by 3.76%. In contrast, Pin 
incurs higher overhead for all applications evaluated, es- 


3We include applications that do not need live update for two rea- 
sons. First, as discussed in §1, LOOM can provide quick workarounds 
for these applications as well. Second, we use them to measure LOOM’s 
overhead and scalability. 
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Figure 14: Effects of LOOM’s optimizations. Label unopt represents the versions with no optimizations; cloning represents the 
version with basic block cloning (§5); wait-flag represents the version with statement “if (wait [stmt_id] )” added (84.2); and 
inlining indicates the version with all LOOM instrumentation inlined into the applications. 
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Figure 13: LOOM’s relative overhead during normal opera- 
tion. Smaller numbers are better. We show Pin’s overhead for 
reference. Some Pin bars are broken. 


pecially for Apache and MySQL. 

We also evaluated how the optimizations we do reduce 
LOOM’s overhead. Figure 14 shows the effects of these 
optimizations. Both cloning and wait-flag are very ef- 
fective at reducing overhead. Cloning reduces LOOM’s 
response-time overhead on Apache from 100% to 17%. 
It also reduces LOOM’s overhead on fft from 15 times to 
8 times. Wait-flag actually makes Apache run faster than 
the original version. Inlining does not help the servers 
much, but it does help for SPLASH2 applications. 


6.2 Scalability 


LOOM synchronizes with application threads via a read- 
write lock. Thus, one concern is, can LOOM scale well 
as the number of application threads increases? To evalu- 
ate LOOM’s scalability, we ran Apache and MySQL with 
LOOM on a 48-core machine with four 1.9 GHz 12-core 
AMD CPUs and 64 GB memory running 64-bit Linux 
2.6.24. In each experiment, we pinned the benchmark to 
one CPU and the server to the other three to avoid unnec- 
essary CPU contention between them. 

Figure 15 shows LOOM’s relative overhead vs. the 
number of application threads for Apache and MySQL. 





Race ID Mutual Unilateral 

Events TPUT RESP Events TPUT RESP 
MySQL-169 2 0.14% 0.15% 1 3.28% 3.37% 
MySQL-644 4 0.22% 0.20% 32.58% 48.34% 
MySQL-791 4 0.23% 0.32% 0.33% 0.48% 


Apache-21287 16 
Apache-25520 1 


-0.02% -0.03% 
0.52% 0.55% 


54.03% 118.16% 
86.04% 637.03% 


FPNN A 


Table 4: Execution filter stats for atomicity errors. Col- 
umn Events counts the number of events in each filter. 





Race ID Events Overhead 
PBZip2 6 1.26% 
SPLASH?2-fft 6 0.08% 
SPLASH?2-lu 2 1.68% 
SPLASH2-barnes 2 1.99% 


Table 5: Execution filter stats for order errors. 


LOoM scales well with the number of threads. Its rela- 
tive overhead varies only slightly. Even with 32 server 
threads, the overhead for Apache is less than 3%, and the 
overhead for MySQL is less than 12%. 

Our initial MySQL overhead was around 16%. 
We analyzed the execution counts of the LOOM- 
inserted functions and immediately identified two 
update-check sites (cycle_check () calls) that exe- 
cuted exceedingly many times. These update-check 
sites are in MySQL functions ptr_compare_1 and 
Field varstring::val_str. The first function 
compares two strings, and the second copies one string to 
another. Each function has a loop with a few statements 
and no function calls. Such tight loops cause higher over- 
head for LOOM, but rarely need to be updated. We thus 
disabled the update-check sites in these two functions, 
which reduced the overhead of MySQL down to 12%. 
This optimization can be easily automated using static or 
dynamic analysis, which we leave for future work. 


6.3 Reliability 


LOOM can be used to fix all races evaluated. (We verified 
this result by manually inspecting the application binary.) 
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Figure 15: LOOM’s relative overhead vs. the number of application threads. 


Table 4 shows the statistics for the execution filters that 
fix atomicity errors. Table 5 shows the statistics for the 
execution filters that fix order errors. 

In all cases, we can fix the race using multiple execu- 
tion filters, demonstrating the flexibility of Loom. (The 
filters for MySQL-791 are shown in Figure 3.) We only 
show the statistics of one execution filter of each con- 
straint type; other filters of the same type are similar. Our 
results show that the filters are fairly small, 3.79 events 
on average and no more than 16 events, demonstrating 
the ease of use of LOOM. Most filters incur only a small 
overhead on top of LOOM. Unilateral filters tend to be 
slightly smaller than mutual exclusion filters, but they 
can be expensive sometimes. They incur little overhead 
for two of the MySQL bugs because the code regions 
protected by the filters rarely run. 

These different reliability and performance overheads 
present an interesting tradeoff to developers. For ex- 
ample, users can choose to install a unilateral filter for 
immediate protection, then atomically replace it with 
a faster mutual exclusion filter. Moreover, a user can 
choose an “expensive” filter as long as their workload 
is compatible with the filter. 


6.4 Availability 


We show that LOOM can improve server availability 
by comparing LOOM to the restart-based software up- 
date approach. We restarted a server by running its 
startup script under /etc/init.d. We chose two 
races, MySQL-791 and Apache-25520, and measured 
how software updates (conventional or with LOOM) 
might degrade performance. Note this comparison fa- 
vors conventional updates because we only compare the 
installation of the fix, but LOOM also makes it quick to 
develop fixes. Figure 16 shows the comparison result. 
Using the restart approach, Apache is unavailable for 4 
seconds, and MySQL is unavailable for 2 seconds. More- 
over, the restarts also cause Apache and MySQL to lose 
their internal cache, leading to a ramp-up period after the 
restart. In contrast, installing an filter using LOOM (at 
second 5) does not degrade throughput for MySQL and 
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only degrades throughput slightly for Apache. 


6.5 Timeliness 


The more timely LOOM installs a filter, the quicker the 
application is protected from the corresponding race. 
This timeliness is critical for server applications because 
malicious clients may exploit a known race and launch 
attacks. In this subsection, we compare how timely 
LOOM’s evacuation algorithm installs an aggressive filter 
vs. an approach that passively waits for function quies- 
cence. We chose Apache-25520 as the benchmark race. 
We wrote a simple mutual exclusion filter that fixes the 
race by making function ap_buffered_log_writer 
a critical region. We then measured the latency from 
the moment LOOM receives a filter to the moment 
the filter is installed. We simulated a function quies- 
cence approach by running LOOM without making any 
wait_flag false, so that a thread can pause wher- 
ever we insert update-checks. We used the same Sys- 
Bench and ApacheBench workload. Our results show 
that LOOM can install the filter within 368 ms. It spends 
majority of the time waiting for threads to evacuate. In 
contrast, an approach based on function quiescence fails 
to install the filter in an hour, our experiment’s time limit. 


7 Related Work 


Live update Loom differs from previous live update 
systems [10, 12, 15, 35, 38, 39, 51] in that it is explic- 
itly designed for developers to quickly develop tempo- 
rary workarounds to races. Moreover, it can automati- 
cally ensure the safety of the workarounds. In contrast, 
previous work focuses only on live update after a source 
patch is available, thus it does not address the automatic- 
safety and flexibility problems LOOM addresses. 

The live update system closest to LOOM is 
STUMP [38], which can live-update multithreaded appli- 
cations written in C. Its prior version Ginseng [39] works 
with single-threaded C applications. Both STUMP and 
Ginseng have been shown to be able to apply arbitrary 
source patches and update applications across major re- 
leases. Unlike LOoM, both STUMP and Ginseng require 
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Figure 16: Throughput degradation for fixing races with LOOM vs. with conventional software update. 


source modifications and rely on extensive user annota- 
tions for safety because the safety of arbitrary live up- 
dates has been proven undecidable [22]. 

A number of live update systems can update kernels 
without reboots [12, 15, 35]. The most recent one, 
Ksplice [12], constructs live updates from object code, 
and does not require developer efforts to adapt existing 
source patches. Unlike LOOM, Ksplice uses function 
quiescence for safety, and is thus prone to the unsafe 
state problem discussed in 84. Another kernel live up- 
date system, DynAMOS [35], requires users to manu- 
ally construct multiple versions of a function to update 
non-quiescent functions. This technique is different from 
basic block cloning (85): the former is manual and for 
safety, whereas the later is automatic and for speed. 
Error workaround and _ recovery We compare 
LOOM to recent error workaround and recovery tools. 
ClearView [44], ASSURE [50], and Failure-oblivious 
computing can increase application availability by 
letting them continue despite errors. Compared to 
Loom, these systems are unsafe, and do not directly 
deal with races. Rx [46] can safely recover from runtime 
faults using application checkpoints and environment 
modifications, but it does not fix errors because the 
same error can re-appear. Vigilante [17] enables hosts 
to collaboratively contain worms using self-verifiable 
alerts. By automatically ensuring filter safety, LOOM 
shares similar benefits. 

Two recent systems, Dimmunix [26] and Gadara [52], 
can fix deadlocks in legacy multithreaded programs. 
Dimmunix extracts signatures from occurred deadlocks 
(or starvations) and dynamically avoids them in future 
executions. Gadara uses control theory to statically 
transform a program into a deadlock-free program. Both 
systems have been shown to work on real, large applica- 
tions. They may possibly be adapted to fix races, albeit at 
a coarser granularity because these systems control only 
lock operations. 

Kivati [16] automatically detects and prevents atom- 
icity violations for production systems. It reduces per- 
formance overhead by cleverly using hardware watch 


points, but the limited number of watch points on com- 
modity hardware means that Kivati cannot prevent all 
atomicity violations. Nor does Kivati prevent execution 
order violations. LOOM can be used to workaround these 
errors missed by Kivati. 

Program instrumentation frameworks Previous work 
(3, 19, 40] can instrument programs with low runtime 
overhead, but instrumentation has to be done at compile 
time. Translation-based dynamic instrumentation frame- 
works [14, 20, 34] can update programs at runtime but 
incur high overhead. In particular, vx32 [20] is a novel 
user-level sandbox that reduces overhead using segmen- 
tation hardware; it can be used as an efficient dynamic 
binary translator. Jump-based instrumentation frame- 
works [24, 48] have low overhead but automatically en- 
suring safety for them can be difficult due to low-level is- 
sues such as position-dependent code, short instructions, 
and locations of basic blocks. 

One advantage of these instrumentation frameworks 
over LOOM is that LOOM requires CFGs and symbol in- 
formation to be distributed to user machines, thus it risks 
leaking proprietary code information. However, this risk 
is not a concern for open-source software. Moreover, 
LOoM only mildly increases this risk because CFGs 
can often be reconstructed from binaries, and companies 
such as Microsoft already share symbol information [4]. 

The advantage of LOOM is that it combines static 
and dynamic instrumentation, thus allowing arbitrary dy- 
namic updates issued by execution filters with negligible 
runtime overhead. LOOM borrows basic block cloning 
from previous work by Liblit et al. [29], but their frame- 
work is static only. This idea has also been used in other 
systems (e.g., LIFT [45]). 

Other related work Our work was inspired by many ob- 
servations made by Lu ef al. [33]. Aspect-oriented pro- 
gramming (AOP) allows developers to “weave” in syn- 
chronizations into code [27, 30]. LOOM’s execution filter 
language shares some similarity to AOP, and can be made 
more expressive by incorporating more aspects. How- 
ever, to the best of our knowledge, no existing AOP sys- 
tems were designed to support race fixing at runtime. We 
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view the large body of race detection and diagnosis work 
(e.g., [31, 32, 37, 42, 47, 49, 54]) as complimentary to 
our work and LOOM can be used to fix errors detected 
and isolated by these tools. 


8 Conclusion 


We have presented LOOM, a live-workaround system de- 
signed to quickly and safely fix application races at run- 
time. Its flexible language allows developers to write 
concise execution filters to declare their synchronization 
intents on code. Its evacuation algorithm automatically 
ensures the safety of execution filters and their installa- 
tion/removal processes. It uses hybrid instrumentation to 
reduce its performance overhead during the normal oper- 
ations of applications. We have evaluated LOOM on nine 
real races from a diverse set of applications. Our results 
show that LOOM is fast, scalable, and easy to use. It can 
safely fix all evaluated races in a timely manner, thereby 
increasing application availability. 

LOoM demonstrates that live-workaround systems can 
increase application availability with little performance 
overhead. In our future work, we plan to extend this idea 
to other classes of errors (e.g., security vulnerabilities). 
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Abstract 


Data races are an important class of concurrency errors where two threads erroneously access a shared memory loca- 
tion without appropriate synchronization. This paper presents DataCollider, a lightweight and effective technique 
for dynamically detecting data races in kernel modules. Unlike existing data-race detection techniques, DataCollider 
is oblivious to the synchronization protocols (such as locking disciplines) the program uses to protect shared 
memory accesses. This is particularly important for low-level kernel code that uses a myriad of complex architec- 
ture/device specific synchronization mechanisms. To reduce the runtime overhead, DataCollider randomly samples 
a small percentage of memory accesses as candidates for data-race detection. The key novelty of DataCollider is that 
it uses breakpoint facilities already supported by many hardware architectures to achieve negligible runtime over- 
heads. We have implemented DataCollider for the Windows 7 kernel and have found 25 confirmed erroneous data 


races of which 12 have already been fixed. 


1. Introduction 


Concurrent systems are hard to design, arguably be- 
cause of the difficulties of finding and fixing concur- 
rency errors. Data races are an important class of con- 
currency errors, where the program fails to use proper 
synchronization when accessing shared data. The ef- 
fects of an erroneous data race can range from immedi- 
ate program crashes to silent lost updates and data cor- 
ruptions that are hard to reproduce and debug. 


Two memory accesses in a program are said to conflict 
if they access the same memory location and at least 
one of them is a write. A program contains a data race 
if two conflicting accesses can occur concurrently. Fig- 
ure | shows a variation of a data race we found in the 
Windows kernel. The threads appear to be accessing 
different fields. However, these bit-fields are mapped to 
the same word by the compiler and the concurrent ac- 
cesses result in a data race. In this case, an update to the 
statistics field possibly hides an update to the status 
field. 


This paper presents DataCollider, a tool for dynamical- 
ly detecting data races in kernel modules. DataCollider 
is lightweight. It samples a small number of memory 
accesses for data-race detection and uses code- 


breakpoint and data-breakpoint’ facilities available in 
modern hardware architectures to efficiently perform 
this sampling. As a result, DataCollider has no runtime 
overhead for non-sampled memory accesses allowing 
the tool to run with negligible overheads for low sam- 
pling rates. 


We have implemented DataCollider for the 32-bit Win- 
dows kernel running on the x86 architecture, and used it 
to detect data races in the core kernel and several mod- 
ules such as the filesystem, the networking stack, the 
storage drivers, and a network file system. We have 
found a total of 25 erroneous data races of which 12 
have already been fixed at the time of writing. In our 
experiments, the tool is able to find erroneous data rac- 
es for sampling rates that incur runtime overheads of 
less than 5%. 


Researchers have proposed multitude of dynamic data- 
race detectors [1,2,3,4,5,6,7] for user-mode programs. 
In essence, these tools work by dynamically monitoring 
the memory accesses and synchronizations performed 
during a concurrent execution. As data races manifest 
rarely at runtime, these tools attempt to infer conflicting 
accesses that could have executed concurrently. The 
tools differ in how they perform this inference, either 





' Data breakpoints are also called hardware watchpoints. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10) =‘ 151 





struct? 
int status:4; 
int pktRevd: 28; 
} st; 


Thread 1 





Thread 2 


st.status = 1; st.pktRcvd ++; 





Figure 1: An example of data race. Even though the 
threads appear to be modifying different variables in 
the source code, the variables are bit fields mapping 
to the same integer 


using the happens-before [8] ordering induced by the 
synchronization operations [4,5,6] or a lock-set based 
reasoning [1] or a combination of the two [2,3,7] 


There are several challenges in engineering a data-race 
detection tool for the kernel based on previous ap- 
proaches. First, the kernel-mode code operates at a low- 
er concurrency abstraction than user-mode code, which 
can rely on clean abstractions of threads and synchroni- 
zations provided by the kernel. In the kernel, the same 
thread context can execute code from a user-mode pro- 
cess, a device interrupt service routine, or a deferred 
procedure call (DPC). In addition, it is an onerous task 
to understand the semantics of complex synchronization 
primitives in order to infer the happens-before relation 
or lock-sets. For instance, Windows supports more than 
a dozen locks with different semantics on how the lock 
holder synchronizes with hardware interrupts, the 
scheduler, and the DPCs. It is also common for kernel 
modules to roll-out custom implementations of syn- 
chronization primitives. 


Second, hardware-facing kernel modules need to syn- 
chronize with hardware devices that concurrently modi- 
fy device state and memory. It is important to design a 
data-race detection tool that can find these otherwise 
hard-to-find data races between the hardware and the 
kernel. 


Finally, existing dynamic data-race detectors add pro- 
hibitive run-time overheads. It is not uncommon for 
such tools to incur up to 200x slowdowns [9]. The 
overhead is primarily due to the need to monitor and 
process all memory and synchronization operations at 
run time. Significant engineering effort in building da- 
ta-race detectors goes in reducing the runtime overhead 
and the associated memory and log management [9,3]. 
Replicating these efforts within the constraints of kernel 
programming is an arduous, if not impossible, task. 
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AtPeriodicIntervals() { 
// determine k based on desired 
// memory access sampling rate 
repeat k times { 
pc = RandomlyChosenMemoryAccess(); 
SetCodeBreakpoint( pe ); 





} 
} 


OnCodeBreakpoint( pe ) { 
// disassemble the instruction at pc 
(loc, size, isWrite) = disasm( pc ); 


DetectConflicts(loc, size, isWrite); 
// set another code break point 
pc = RandomlyChosenMemoryAccess (); 
SetCodeBreakpoint( pce ); 

} 


DetectConflicts( loc, size, isWrite) { 


temp = read( loc, size ); 
if ( isWrite ) 

SetDataBreakpointRW( loc, size ); 
else 


SetDataBreakpointW( loc, size ); 
delay(); 


ClearDataBreakpoint( loc, size ); 


temp’ = read( loc, size ); 
if( temp != temp’ || 

data breakpoint fired ) 
ReportDataRace( ); 


Figure 2: The basics of the DataCollider algo- 
rithm. Right before a read or write access to shared 
memory location, chosen at random, DataCollider 
monitors for any concurrent accesses that conflict 
with the current access. 


Moreover, these tools rely on invasive instrumentation 
techniques that are difficult to get right on low-level 
kernel code. 


DataCollider uses a different approach to overcome 
these challenges. The crux of the algorithm is shown in 
Figure 2. DataCollider samples a small number of 
memory accesses at runtime by inserting code break- 
points at randomly chosen memory access instructions. 
When a code breakpoint fires, DataCollider detects data 
races involving the sampled memory access for a small 
time window. It simultaneously employs two strategies 
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to do so. First, DataCollider sets a data breakpoint to 
trap conflicting accesses by other threads. To detect 
conflicting writes performed by hardware devices and 
by processors accessing the memory location through a 
different virtual address, DataCollider use a repeated- 
read strategy. It reads the value once before and once 
after the delay. A change in value is an indication of a 
conflicting write, and hence a data race. 


The DataCollider algorithm has two features that make 
it suitable for kernel data-race detection. First and 
foremost, it is easy to implement. Barring some imple- 
mentation details (Section 3), the entire algorithm is 
shown in Figure 2. In addition, it is entirely oblivious to 
the synchronization protocols used by the kernel and 
the hardware, a welcome design point as DataCollider 
does not have to understand the complex semantics of 
kernel synchronization primitives. 


When the DataCollider finds a data race through the 
data-breakpoint strategy, it catches both threads “red- 
handed,” as they are about to execute conflicting ac- 
cesses. This greatly simplifies the debugging of data 
race reports from DataCollider as the tool can collect 
useful debugging information, such as the stack trace of 
the racing threads along with their context information, 
without incurring this overhead on non-sampled or non- 
racy accesses. 


Not all data races are erroneous. Such benign races in- 
clude races that do not affect the program outcome, 
such as updates to logging/debugging variables, and 
races that affect the program outcome in a manner ac- 
ceptable to the programmer, such as conflicting updates 
to a low-fidelity counter. DataCollider uses a post- 
processing phase that prunes and prioritizes the data- 
race reports before showing them to the user. In our 
experience with DataCollider, we have observed that 
only around 10% percentage of data-race reports corre- 
spond to real errors, making the post-processing step 
absolutely crucial for the usability of the tool. 


2. Background and Motivation 


Shared memory multiprocessors are specifically built to 
allow concurrent access to shared data. So why do data 
races represent a problem at all? 


The key motivation for data race detection is the empir- 
ic fact that programmers most often use synchroniza- 
tion to restrict accesses to shared memory. Data races 
can thus be an indication of incorrect or insufficient 
synchronization in the program. In addition, data races 
can also reveal programming mistakes not directly re- 
lated to concurrency, such as buffer overruns or use- 


after-free, which indirectly result in inadvertent sharing 
of memory. 


Another important reason for avoiding data races is to 
protect the program from the weak memory models of 
the compiler and the hardware. Both the compiler and 
hardware can reorder instructions and change the be- 
havior of racy programs in complex and confusing 
ways [10,11]. Even if a racy program works correctly 
for the current compiler and hardware configuration, it 
might fail on future configurations that implement more 
aggressive memory-model relaxations. 


While bugs caused by data races may of course be 
found using more conventional testing approaches such 
as stress testing, the latter often fails to provide actiona- 
ble information to the programmer. Clearly, a data race 
report including stack traces or data values (or even 
better, including a core dump that is demonstrating the 
actual data race) is easier to understand and fix than a 
silent data corruption that leads to an obscure failure at 
some later point during program execution. 


2.1. Definition of Data Race 


There is no “gold standard” for defining data races; 
several researchers have used the term to mean different 
things. For our definition, we consulted two respected 
standards (Posix threads [12] and the drafts of the C++ 
and C memory model standards [11,10]) and general- 
ized their definitions to account for the particularities of 
kernel code. Our definition of data race is: 


e Two operations that access main memory are 


called conflicting if 
o the physical memory they access is not 
disjoint, 


o at least one of them is a write, and 
o they are not both synchronization access- 
es. 

e =A program has a data race if it can be executed on 
a multiprocessor in such a way that two conflicting 
memory accesses are performed simultaneously 
(by processors or any other device). 


This definition is a simplification of [11,10] insofar we 
replaced the tricky notion of “not ordered before” with 
the unambiguous “performed simultaneously” (which 
refers to real time). 


An important part of our definition is the distinction 
between synchronization and data accesses. Clearly, 
some memory accesses participate in perfectly desirable 
races: for example, a mutex implementation may per- 
form a “release” by storing the value 0 in a shared loca- 
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tion, while another thread is performing an acquire and 
reads the same memory location. However, this is not a 
data race because we categorize both of these accesses 
as synchronization accesses. Synchronization accesses 
either involve hardware synchronization primitives 
such as interlocked instructions or use volatile or atom- 
ic annotations supported by the compiler. 


Note that our definition is general enough to apply to 
code running in the kernel, which poses some unique 
problems not found in user-mode code. For example, in 
some cases data races can be avoided by turning off 
interrupts; also, processes can exhibit a data race when 
accessing different virtual addresses that map to the 
same physical address. We talk more about these topics 
in Section 2.3.4. 


2.2. Precision of Detection 


Clearly, we would like data race detection tools to re- 
port as many data races as possible without inundating 
the user with false error reports. We use the following 
terminology to discuss the precision and completeness 
of data race detectors. A missed race is a data race that 
the tool does not warn about. A benign data race is a 
data race that does not adversely affect the behavior of 
the program. Common examples of benign data races 
include threads racing on updates to logging or statistics 
variables and threads concurrently updating a shared 
counter where the occasional incorrect update of the 
counter does not affect the outcome of the program. On 
the other hand, a false data race is an error report that 
does not correspond to a data race in the program. Stat- 
ic data-race detection techniques commonly produce 
false data races due to their inherent inability to precise- 
ly reason about program paths, aliased heap objects, 
and function pointers. Dynamic data-race detectors can 
report false data races if they do not identify or do not 
understand the semantics of al/ the synchronizations 
used by the program. 


2.3. Related Work 


Researchers have proposed and built a plethora of race 
detection tools. We now discuss the major approaches 
and implementation techniques appearing in related 
work. We describe both happens-before-based and 
lock-set-based tracking in some detail (Sections 2.3.2 
and 2.3.3), before explaining why neither one is very 
practical for data race detection in the kernel (Section 
2.3.4). 
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2.3.1. Static vs. Dynamic 


Data race detection can be broadly categorized into 
static race detection [13,14,15,16,17], which typically 
analyzes source or byte code without directly executing 
the program, and dynamic race detection [1,2,3,4,5,6,7], 
which instruments the program and monitors its execu- 
tion online or offline. 


Static race detectors have been successfully applied to 
large code bases [13,14]. However, as they rely on ap- 
proximate information, such as pointer aliasing, they 
are prone to excessive false warnings. Some tools, es- 
pecially those targeting large code bases, approach this 
issue by filtering the reported warnings using heuristics 
[13]. Such heuristics can successfully reduce the false 
warnings to a tolerable level, but may unfortunately 
also eliminate correct warnings and lead to missed rac- 
es. Other tools, targeted towards highly motivated users 
that wish to interactively prove absence of data races, 
report all potential races to the user and rely on user- 
supplied annotations that indicate synchronization dis- 
ciplines [16,17]. 


Dynamic data race detectors are less prone to false 
warnings than static techniques because they monitor 
an actual execution of the program. However, they may 
miss races because successful detection might require 
an error-inducing input and/or an appropriate thread 
schedule. Also, many dynamic detectors employ several 
heuristics and approximations that can lead to false 
alarms. 


Dynamic data race detectors can be classified into cate- 
gories based on whether they model a happens-before 
relation [6,5,7] (see Section 2.3.2), lock sets [1] (see 
Section 2.3.3), or both [2,18]. 


2.3.2. Happens-Before Tracking 


Dynamic data race detectors do not just detect data rac- 
es that actually took place (in the sense that the conflict- 
ing accesses were truly simultaneous during the execu- 
tion), but look for evidence that such a schedule would 
have been possible for a slightly different timing. 
Tracking a happens-before relation on program events 
[8] is one way to infer the existence of a racy schedule. 
This transitive relation is constructed by recording both 
the ordering of events within a thread and the ordering 
effects of synchronization operations across threads. 


Once we can properly track the happens-before relation, 
race detection is straightforward: For any two conflict- 
ing accesses A and B, we simply check whether A hap- 
pens-before B, or B happens-before A, or neither. If 
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neither, we know there exists a schedule where A and B 
are simultaneous. If properly tracked, happens-before 
does not lead to any false alarms. However, precise 
tracking can be difficult to achieve in practice, as dis- 
cussed in Section 2.3.4. 


2.3.3. Lock Sets 


When detecting races in programs that follow a strict 
and consistent locking discipline, using a lock-set ap- 
proach can provide some benefits. The basic idea is to 
examine the lock set of each data access (that is, the set 
of locks held during the access) and then to take for 
each memory location the intersection of the lock sets 
of all accesses to it. If that intersection is empty, the 
variable is not consistently protected by any one lock 
and a warning is issued. 


The main limitation of the lock set approach is that it 
does not check for true data races but for violations of a 
specific locking discipline. Unfortunately, many appli- 
cations (and in particular kernel code) use locking dis- 
ciplines that are complex and use synchronization other 
than locks. 


Whenever a program departs from a simple locking 
scheme in any of the above ways, lock-set-based race 
detectors will be forced to either issue false warnings, 
or to use heuristics to suppress these warnings. The 
latter approach is common, especially in the form of 
state machines that track the “sharing status” of a varia- 
ble [1,3]. Such heuristics are necessarily imperfect 
compromises, however (they always fail to suppress 
some false warnings and always suppress some correct 
warnings), and it is not clear how to tune them to be 
useful for a wide range of applications. 


2.3.4. Problems with Tracking Synchroni- 
zations 


Both lock-set and happens-before tracking require a 
thorough understanding of the synchronization seman- 
tics, lest they produce false alarms or miss races. There 
are two fundamental difficulties we encountered when 
trying to apply these techniques in the kernel: 


e Abstractions that we take for granted in user mode 
(such as threads) are no longer clearly defined in 
kernel mode. 

e The synchronization vocabulary of kernel code is 
much richer and may include complicated se- 
quences and ordering mechanisms provided by the 
hardware. 


For example, interrupts and interrupt handlers break the 
thread abstraction, as the handler code may execute in a 
thread context without being part of that thread in a 
logical sense. Similar problems arise when a thread 
calls into the kernel scheduler. The code executing in 
the scheduler is not logically part of that same thread. 


Another example illustrating the difficulty of modeling 
synchronization inside the kernel are DMA accesses. 
Such accesses are not executing inside a thread (in fact, 
they are not even executing on a processor). Clearly, 
traditional monitoring techniques have a problem be- 
cause they cannot “instrument” the DMA access. 


Similar case holds for interrupt processing. For exam- 
ple, code may first write some data and then raise an 
interrupt, and then the same data is read by an interrupt 
handler. Lock sets would report a false alarm because 
the data is not locked. But even happens-before tech- 
niques are problematic, because they would need to 
precisely track the causality between the instruction that 
set the interrupt and the interrupt handler. 


For these reasons, we decided to employ a design that 
entirely avoids modeling the happens-before ordering 
or lock-sets. As our results show, somewhat surprising- 
ly, neither one is required to build an effective data race 
detector. 


2.3.5. Sampling to Reduce Overhead 


To detect races, dynamic data race detectors need to 
monitor the synchronizations and memory accesses 
performed at runtime. This is typically done by instru- 
menting the code and inserting extra monitoring code 
for each data access. As the monitoring code executes 
at every memory access, the overhead can be quite sub- 
stantial. 


One way to ameliorate this issue is to exclude some 
data accesses from processing. Prior work has identi- 
fied several promising strategies: adaptive sampling 
that backs off hot locations [5] (the idea is that for such 
locations the monitoring can be less frequent and still 
detect races), or perform the full monitoring only for a 
fixed fraction of the time [4] (the idea is that the proba- 
bility of catching a race is roughly proportional to this 
fraction multiplied by the number of times the race re- 
peats). But these techniques still suffer from the cost of 
sampling, performed at every memory access. DataCol- 
lider avoids this problem by using hardware breakpoint 
mechanisms. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) —-155 


156 


3. DataCollider Implementation 


This section describes the implementation of the 
DataCollider algorithm for the Windows kernel on the 
x86 architecture. The implementation heavily uses the 
code and data breakpoint mechanisms available on x86. 
The techniques described in this paper can be extended 
to other architectures and to user-mode code. But we 
have not pursued this direction in this paper. 


Figure 2 describes the basics of the DataCollider algo- 
rithm. DataCollider uses the sampling algorithm, de- 
scribed in Section 3.1, to process a small percentage of 
memory accesses for data-race detection. For each of 
the sampled memory accesses, DataCollider uses a con- 
flict detection mechanism, described in Section 3.2, to 
find data races involving the sampled access. After de- 
tecting data races, DataCollider uses several heuristics, 
described in Section 3.3, to prune benign data races. 


3.1. The Sampling Algorithm 


There are several challenges in designing a good sam- 
pling algorithm for data-race detection. First, data races 
involve two memory accesses both of which need to be 
sampled to detect the race. If memory accesses are 
sampled independently, then the probability of finding 
the data race is a product of the individual sampling 
probabilities. DataCollider avoids this multiplicative 
effect by sampling the first access and using a data 
breakpoint to trap the second access. This allows 
DataCollider to be effective at low sampling rates. 


Second, data races are rare events — most executed in- 
structions do not result in a data race. The sampling 
algorithm should weed out the small percentage of rac- 
ing accesses from the majority of non-racing accesses. 
The key intuition behind the sampling algorithm is that 
if a program location is buggy and fails to use the right 
synchronization when accessing shared data, then every 
dynamic execution of that buggy code is likely to par- 
ticipate in a data race. Accordingly, DataCollider per- 
forms static sampling of program locations rather than 
dynamic sampling of executed instructions. A static 
sampler provides equal preference to rarely execution 
instructions (which are likely to have bugs hidden in 
them) and frequently executed instructions. 


3.1.1. Static Sampling Using Code Break- 
points 


The static sampling algorithm works as follows. Given 
a program binary, DataCollider disassembles the binary 
to generate a sampling set consisting of all program 
locations that access memory. The tool currently re- 
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quires the debugging symbols of the program binary to 
perform this disassembly. This requirement can be re- 
laxed by using sophisticated disassemblers [19] in the 
future. 


DataCollider performs a simple static analysis to identi- 
fy instructions that are guaranteed to only touch thread- 
local stack locations and removes them from the sam- 
pling set. Similarly, DataCollider removes synchroniz- 
ing instructions from the sampling set by removing 
instructions that accesses memory locations tagged as 
“volatile” or those that use hardware synchronization 
primitives, such as interlocked. This prevents DataCol- 
lider from reporting races on synchronization variables. 
However, DataCollider can still detect a data race be- 
tween a synchronization access and a regular data ac- 
cess, if the latter is in the sampling set. 


DataCollider samples program locations from the sam- 
pling set by inserting code breakpoints. The initial 
breakpoints are set at a small number of program loca- 
tions chosen uniformly randomly from the sampling set. 
If and when a code breakpoint fires, DataCollider per- 
forms conflict detection for the memory access at that 
breakpoint. Then, DataCollider choses another program 
location uniformly randomly from the sampling set and 
sets a breakpoint at that location. 


This algorithm uniformly samples all program locations 
in the sampling set irrespective of the frequency with 
which the program executes these locations. This is 
because the choice of inserting a code breakpoint is 
performed uniformly at random for all locations in the 
sampling set. Over a period of time, the breakpoints 
will tend to reside at rarely executed program locations, 
increasing the likelihood that those locations are sam- 
pled the next time they execute. 


If DataCollider has information on which program loca- 
tions are likely to participate in a race, either through 
user annotations or through prior analysis [20] then the 
tool can prioritize those locations by biasing their selec- 
tion from the sampling set. 


3.1.2. Controlling the Sampling Rate 


While the program cannot affect the sampling distribu- 
tion over program locations, the sampling rate is inti- 
mately tied to how frequently the program executes 
locations with a code breakpoint. In the worst case, if 
all of the breakpoints are set on dead code, DataCollider 
will stop performing data-race detection altogether. To 
avoid this and to better control the sampling rate, 
DataCollider periodically checks the number of break- 
points fired every second, and adjusts the number of 
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breakpoints set in the program based on whether the 
experienced sampling rate is higher or lower than the 
target rate. 


3.2. Conflict-Detection 


As described in the previous section, DataCollider picks 
a small percentage of memory accesses as likely candi- 
dates for data-race detection. For these sampled access- 
es, DataCollider pauses the current thread waiting to 
see if another thread makes a conflicting access to the 
same memory location. It uses two strategies: data 
breakpoints and repeated-reads. DataCollider uses these 
two strategies simultaneously as each complements the 
weaknesses of the other. 


3.2.1. Detecting Conflicts with Data Break- 
points 


Modern hardware architectures provide a facility to trap 
when a processor reads or writes a particular memory 
location. This is crucial for efficient support for data 
breakpoints in debuggers. The x86 hardware supports 
four data breakpoint registers. DataCollider uses them 
to effectively monitor possible conflicting accesses to 
the currently sampled access. 


When the current access is a write, DataCollider in- 
structs the processor to trap on a read or write to the 
memory location. If the current access is a read, 
DataCollider instructs the processor to trap only on a 
write, as concurrent reads to the same location do not 
conflict. If no conflicting accesses are detected, 
DataCollider resumes the execution of the current 
thread after clearing the data breakpoint registers. 


Each processor has a separate data breakpoint register. 
DataCollider uses an inter-processor interrupt to update 
the break points on all processors atomically. This also 
synchronizes multiple threads attempting to sample 
different memory locations concurrently. 


An x86 instruction can access variable sized memory. 
For 8, 16, or 32-bit accesses, DataCollider sets a break- 
point of the appropriate size. The x86 processor traps if 
another instruction accesses a memory location that 
overlaps with a given breakpoint. Luckily, this is pre- 
cisely the semantics required for data-race detection. 
For accesses that span more than 32 bits, DataCollider 
uses more than one breakpoint up to the maximum 
available of four. If DataCollider runs out of breakpoint 
registers, it simply resorts to the repeated-read strategy 
discussed below. 


When a data breakpoint fires, DataCollider has success- 
fully detected a race. More importantly, it has caught 
the racing threads “red handed” — the two threads are at 
the point of executing conflicting accesses to the same 
memory location. 


One particular shortcoming of data breakpoint support 
in x86 that we had to work around was the fact that, 
when paging is enabled, x86 performs the breakpoint 
comparisons based on the virtual address and has no 
mechanism to modify this behavior. Two concurrent 
accesses to the same virtual addresses but different 
physical addresses do not race. In Windows, most of 
the kernel resides in the same address space with two 
exceptions. 


Kernel threads accessing the user address space cannot 
conflict if the threads are executing in the context of 
different processes. If a sampled access lies in the user 
address space, DataCollider does not use breakpoints 
and defaults to the repeated-read strategy. 


Similarly, a range of kernel-address space, called ses- 
sion memory, is mapped to different address spaces 
based on the session the process belongs to. When a 
sampled access lies in the session memory space, 
DataCollider sets a data breakpoint but checks if the 
conflicting accesses belong to the same session before 
reporting the conflict to the user. 


Finally, a data breakpoint will miss conflicts if a pro- 
cessor uses a different virtual address mapped to the 
same physical address as the sampled access. Similarly, 
data breakpoints cannot detect conflicts arising from 
hardware devices directly accessing memory. The re- 
peated-read strategy discussed below covers all these 
cases. 


3.2.2. Detecting Conflicts with Repeated 
Reads 


The repeated-read strategy relies on a simple insight: if 
a conflicting write changes the value of a memory loca- 
tion, DataCollider can detect this by repeatedly reading 
the memory location checking for value changes. An 
obvious disadvantage of this approach is that it cannot 
detect conflicting reads. Similarly, it cannot detect mul- 
tiple conflicting writes the last of which writes the same 
value as the initial value. Despite these shortcomings, 
we have found this strategy to be very useful in prac- 
tice. This is the first strategy we implemented (as it is 
easier to implement than using data breakpoints) and 
we were able to find several kernel bugs with this ap- 
proach. 
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However, repeated-reads strategy catches only one of 
the two threads “red-handed.” This makes it harder to 
debug data races, as one does not know which thread or 
device was responsible for the conflicting write. This 
was our prime motivation for using data breakpoints. 


3.2.3. Inserting Delays 


For a sampled memory access, DataCollider attempts to 
detect a conflicting access to the same memory location 
by delaying the thread for a short amount of time. For 
DataCollider to be successful, this delay has to be long 
enough for the conflicting access to occur. On the other 
hand, delaying the thread for too long can be dangerous 
especially if the thread holds some resource crucial for 
the proper functioning of the entire system. In general, 
it is impossible to predict how long to insert the delay. 
After experimenting with many values, we chose the 
following delay algorithm. 


Depending on the IRQL (Interrupt Request Level) of 
the executing thread, DataCollider delays the thread for 
a preset maximum amount of time. At IRQLs higher 
than the DISPATCH level (the level at which the kernel 
scheduler operates), DataCollider does not insert any 
delay. We considered inserting a small window of delay 
at this level to identify possible data races between in- 
terrupt service routines. But we did not expect that 
DataCollider would be effective at short delays. 


Threads running at the DISPATCH level cannot yield 
the processor to another thread. As such, the delay is 
simply a busy loop. We currently delay threads at this 
level for a random amount of time less than | ms. For 
lower IRQLs, DataCollider delays the thread for a max- 
imum of 15 ms by spinning in a loop that yields the 
current time quantum. During this loop, the thread re- 
peatedly checks to see if other threads are making pro- 
gress by inspecting the rate at which breakpoints fire. If 
progress is not detected, the waiting thread prematurely 
stops its wait. 


3.3. Dealing with Benign Data Races 


Research on data-race detection has amply noted the 
fact that not all data races are erroneous. A practical 
data-race detection tool should effectively prune or 
deprioritize these benign data races when reporting to 
the user. However, inferring whether or not a data race 
is benign can be tricky and might require deep under- 
standing of the program. For instance, a data race be- 
tween two concurrent non-atomic counter updates 
might be benign if the counter is a statistic variable 
whose fidelity is not important to the behavior of the 
program. However, if the counter is used to maintain 
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the number of references to a shared object, then the 
data race could lead to a memory leak or a premature 
free of the object. 


During the initial runs of the tool, we found that around 
90% of the data-race reports are benign. Inspecting the- 
se we identified the following patterns that can be iden- 
tified through simple static and/or dynamic analysis and 
incorporated them in a post-process pruning phase. 


Statistics Counters: Around half of the benign data 
races involved conflicting updates to counters that 
maintain various statistics about the program behavior 
[21]. These counters are not necessarily write-only and 
could affect the control flow of the program. A com- 
mon scenario is to use these counter value to perform 
periodic computation such as flushing a log buffer. If 
DataCollider reports several data races involving an 
increment instruction and the value of the memory loca- 
tion consistently increases across these reports, then the 
pruning phase tags these data races as statistics-counter 
races. Checking for an increase in memory values helps 
the pruning phase in distinguishing these statistics 
counters from reference counters that are usually both 
incremented and decremented. 


Safe Flag Updates: The next prominent class of benign 
races involves a thread reading a flag bit in a memory 
location while another thread updates a different bit in 
the same memory location. By analyzing few memory 
instructions before and after the memory access, the 
pruning phase identifies read-write conflicts that in- 
volve different bits. On the other hand, write-write con- 
flicts can result in lost updates (as shown in Figure 1) 
and are not tagged as benign. 


Special Variables: Some of the data races reported by 
DataCollider involve special variables in the kernel 
where races are expected. For instance, Windows main- 
tains the current time in a variable, which is read by 
many threads while being updated by the timer inter- 
rupt. The pruning phase has a database of such varia- 
bles and prunes races involving these variables. 


While it is possible to design other patterns that identify 
benign data races, one has to tradeoff the benefit of the 
pruning achieved with the risk of missing real data rac- 
es. For instance, we initially designed a pattern to clas- 
sify two writes that write the same value as benign. 
However, very few data-race reports matched this prop- 
erty. On the other hand, Figure 4 shows an example of a 
harmful data-race that we found involving two such 
writes. 


Also, we have made an explicit decision to make the 
benign data races available to the user, but deprioritized 
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Figure 3: Bugs reported to the developers after 
excluding benign data-race reports. 


against races that are less likely to be benign. Some of 
our users are interested in browsing through the pruned 
benign races to identify potential portability problems 
and memory-model issues in their code. We also found 
an instance where a benign race, despite being harm- 
less, indicated unintended sharing in the code and re- 
sulted in a design change. 


4. Evaluation 


There are two metrics for measuring the success of a 
data-race detection tool. First, is it able to find data rac- 
es that programmers deem important enough to fix? 
Second, is it able to scale to a large system, which in 
our case is the Windows operating system, with reason- 
able runtime overheads? This section presents a case for 
an affirmative claim on these two metrics. 


4.1. Experimental Setup 


For the discussion in this section, we applied DataCol- 
lider on several modules in the Windows operating sys- 
tem. DataCollider has been has been used on class driv- 
ers, various PnP drivers, local and remote file system 
drivers, storage drivers, and the core kernel executive 
itself. We are successfully able to boot the operating 
system with DataCollider and run existing kernel stress 
tests. 


4.2. Bugs Found 


Figure 3 presents the data race reports produced by the 
different versions of DataCollider during its entire de- 


velopment. We reported a total 38 data-race reports to 
the developers. This figure does not reflect the number 
of benign data races pruned heuristically and manually. 
We defer the discussion of benign data races to Section 
4.4. 


Of these 38 reports, 25 have been confirmed as bugs 
and 12 of which have already been fixed. The develop- 
ers indicated that 5 of these are indeed harmless. For 
instance, one of the benign data races results in a driver 
issuing an idempotent request to the device. While this 
could result in a performance loss, the expected fre- 
quency of the data race did not justify the cost of add- 
ing synchronization in the common case. Identifying 
such benign races requires intimate knowledge of the 
code and would not be possible without the program- 
mers help. 


As DataCollider naturally delays the racing access that 
temporally occurs first, it is likely to explore both out- 
comes of the race. Despite this, only one of the 38 data 
races crashed the kernel in our experiments. This indi- 
cates that the effects of an erroneous data race are not 
immediately apparent for the particular input or the 
hardware configuration of the current run. 


We discuss two interesting error reports below 


4.2.1. A Boot Hang Caused by a Data Race 


A hardware vendor was consistently seeing a kernel 
hang at boot-up time. This was not reproducible in any 
of the in-house machine configurations, till the vendor 
actually shipped the hardware to the developers. After 
inspecting the hang, a developer noticed a memory cor- 
ruption in a driver that could be a result of a race condi- 
tion. When analyzing the driver in question, DataCol- 
lider found the data race in an hour of testing on a regu- 
lar in-house machine (in which the kernel did not hang). 
Once the source of the corruption was found (perform- 
ing a status update non-atomically), the bug was imme- 
diately fixed. 
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void AddToCache() { 

















// 
A: x &= ~(FLAG NOT DELETED) ; 
B: x |= FLAG CACHED; 





MemoryBarrier(); 


// 
} 


AddToCache(); 
assert( x & FLAG CACHED ); 





Figure 4: An erroneous data race when the 
AddToCache function is called concurrently. 
Though the data race appears benign, as the con- 
flicting accesses “write the same values,” the as- 
sert can fail on some thread schedules. 


4.2.2. A Not-So-Benign Data Race 


Figure 4 shows an erroneous data race. The function 
AddToCache performs two non-atomic updates to the 
flag variable. DataCollider produced an error report 
with two threads simultaneously updating the flag at 
location B. Usually, two instructions writing the same 
values is a good hint that the data race is benign. How- 
ever, the presence of the memory barrier indicated that 
this report required further attention — the developer 
was well aware of consequences of concurrency and the 
rest of the code relied on crucial invariants on the flag 
updates. When we reported this data race to the devel- 
oper he initially tagged it as benign. On further discus- 
sion, we discovered that the code relied on the invariant 
that the CACHED bit is set after a call to AddToCache. 
The data race can break this invariant when a concur- 
rent thread overwrites CACHED bit when performing the 
update at A, but gets preempted before setting the bit at 
B. 








4.2.3. How Fixed 


While data races can be hard to find and result in mys- 
terious crashes, our experience is that most are relative- 
ly easy to fix. Of the 12 bugs, 3 were the result of miss- 
ing locks. The developer could easily identify the lock- 
ing discipline that was meant to be followed, and could 
decide which lock to add without the fear of a deadlock. 
6 data races were the fixed by using an atomic instruc- 
tions, such as interlocked increment, to make a read- 
modify-write to a shared variable. 2 bugs were a result 
of unintended sharing and were fixed by making the 
particular variable thread local. Finally, one bug indi- 
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Figure 5: Runtime overhead of DataCollider with in- 
creasing sampling rate, measured in terms of the num- 
ber of code breakpoints firing per second. The over- 
head tends to zero as the sampling rate is reduced, in- 
dicating that the tool has negligible base overhead. 
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Figure 6: The number of data races, uniquely identi- 
fied by the pair of racing program locations, with the 
runtime overhead. DataCollider is able to report data 
race even under overheads under 5% 


cated a broken design due to a recent refactoring and 
resulted in a design change. 


4.3. Runtime Overhead 


Users have an inherent aversion to dynamic analysis 
tools that add prohibitive runtime overheads. The obvi- 
ous reason is the associated wastage of test resources — 
a slowdown of ten means that only one-tenth the 
amount of testing can be done with a given amount of 
resources. More importantly, runtime overheads intro- 
duced by a tool can affect the real-time execution of the 
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Figure 7: Categorization of data races found by 
DataCollider during kernel stress. 


program. The operating system could start a recovery 
action if a device interrupt takes too long to finish. Or a 
test harness can incorrectly tag a kernel-build faulty if it 
takes too long to boot. 


To measure the runtime overhead of DataCollider, we 
repeatedly measured the time taken for the boot- 
shutdown sequence for different sampling rates and 
compared against a baseline Windows kernel running 
without DataCollider. These experiments where done 
on the x86 version of Windows 7 running on a virtual 
machine with 2 processors and 512 MB memory. The 
host machine is an Intel Core2-Quad 2.4 GHz machine 
with 4 GB memory running Windows Server 2008. 
The guest machine was limited to 50% of the pro- 
cessing resources of the host. This was done to prevent 
any background activity on the host from perturbing the 
performance of the guest. 


Figure 5 shows the runtime overhead of DataCollider 
for different sampling rates, measured by the average 
number of code breakpoints fired per second during the 
run. As expected, the overhead increases roughly line- 
arly with the sampling rate. More interestingly, as the 
sampling rate tends to zero, DataCollider’s overhead 
reaches zero. This indicates that DataCollider can be 
“always on” in various testing and deployment scenari- 
os, allowing the user to tune the overhead to any ac- 
ceptable limit. 


Figure 6 shows the number of data races detected for 
different runtime costs. DataCollider is able to detect 
data races even for overheads less than 5% indicating 
the utility of the tool at low overheads. 


4.4. Benign Data Races 


Finally, we performed an experiment to measure the 
efficacy of our pruning algorithm for benign data races. 
The results are shown in Figure 7. We enabled 
DataCollider while running kernel stress tests for 2 
hours sampling at approximately 1000 code breakpoints 
per second. DataCollider found a total of 113 unique 
data races. The patterns described in Section 3.3 can 
identify 86 (76%) of these as benign errors. We manu- 
ally (and painfully) triaged these reports to ensure that 
these races were truly benign. Of the remaining races, 
we manually identified 18 as not erroneous. 8 of them 
involved the double-checked locking idiom, where a 
thread performs a racy read of a flag without holding a 
lock, but reconfirms the value after acquiring the lock. 
8 were accesses to volatile variables that DataCollider’s 
analysis was unable to infer the type of. These reports 
can be avoided with a more sophisticated analysis for 
determining the program types. This table demonstrates 
that a significant percentage of benign data races can be 
heuristically pruned without risks of missing real data 
races. During this process, we found 9 potentially harm- 
ful data races of which 5 have already been confirmed 
as bugs. 


5. Conclusion 


This paper describes DataCollider, a lightweight and 
effective data-race detector specifically designed for 
low-level systems code. Using our implementation of 
DataCollider for the Windows operating system, we 
have found to date 25 erroneous data races of which 12 
are already fixed. 


We would like to thank our shepherd Junfeng Yang and 
all our anonymous reviewers for valuable feedback on 
the paper. 
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Abstract 


Many synchronizations in existing multi-threaded pro- 
grams are implemented in an ad hoc way. The first part 
of this paper does a comprehensive characteristic study 
of ad hoc synchronizations in concurrent programs. By 
studying 229 ad hoc synchronizations in 12 programs of 
various types (server, desktop and scientific), including 
Apache, MySQL, Mozilla, etc., we find several interest- 
ing and perhaps alarming characteristics: (1) Every stud- 
ied application uses ad hoc synchronizations. Specifically, 
there are 6—83 ad hoc synchronizations in each program. 
(2) Ad hoc synchronizations are error-prone. Significant 
percentages (22-67%) of these ad hoc synchronizations 
introduced bugs or severe performance issues. (3) Ad hoc 
synchronization implementations are diverse and many of 
them cannot be easily recognized as synchronizations, i.e. 
have poor readability and maintainability. 

The second part of our work builds a tool called 
SyncFinder to automatically identify and annotate ad hoc 
synchronizations in concurrent programs written in C/C++ 
to assist programmers in porting their code to better struc- 
tured implementations, while also enabling other tools 
to recognize them as synchronizations. Our evaluation 
using 25 concurrent programs shows that, on average, 
SyncFinder can automatically identify 96% of ad hoc syn- 
chronizations with 6% false positives. 

We also build two use cases to leverage SyncFinder’s 
auto-annotation. The first one uses annotation to detect 5 
deadlocks (including 2 new ones) and 16 potential issues 
missed by previous analysis tools in Apache, MySQL and 
Mozilla. The second use case reduces Valgrind data race 
checker’s false positive rates by 43-86%. 


1 Introduction 


Synchronization plays an important role in concurrent pro- 
grams. Recently, partially due to realization of multi- 
core processors, much work has been conducted on syn- 
chronization in concurrent programs. For example, vari- 
ous hardware/software designs and implementations have 
been proposed for transactional memory (TM) [37, 13, 30, 
40] as ways to replace the cumbersome “lock” operations. 
Similar to TM, some new language constructs [46, 7, 12] 
such as Atomizer [12] have also been proposed to address 
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the atomicity problem. On a different but related note, 
various tools such as AVIO [27], CHESS [31], CTrig- 
ger [36], ConTest [6] have been built to detect or ex- 
pose atomicity violations and data races in concurrent pro- 
grams. In addition to atomicity synchronization, condition 
variables and monitor mechanisms have also been studied 
and used to ensure certain execution order among multiple 
threads [14, 16, 22]. 

So far, most of the existing work has targeted only the 
synchronizations implemented in a modularized way, i.e., 
directly calling some primitives such as “lock/unlock” and 
“cond_wait/cond_signal” from standard POSIX thread li- 
braries or using customized interfaces implemented by 
programmers themselves. Such synchronization methods 
are easy to recognize by programmers, or bug detection 
and performance profiling tools. 

Unfortunately, besides modularized synchronizations, 
programmers also use their own ad hoc ways to do syn- 
chronizations. It is usually hard to tell ad hoc synchro- 
niztions apart from ordinary thread-local computations, 
making it difficult to recognize by other programmers for 
maintenance, or tools for bug detection and performance 
profiling. We refer to such synchronization as ad hoc syn- 
chronization. If a program defines its own synchronization 
primitives as functional calls and then uses these functions 
throughout the program for synchronization, then we do 
not consider these primitives as ad hoc, since they are well 
modularized. 

Ad hoc synchronization is often used to ensure an in- 
tended execution order of certain operations. Specifi- 
cally, instead of calling“cond_wait()” and “cond-signal()” 
or other synchronization primitives, programmers often 
use ad hoc loops to synchronize with some shared vari- 
ables, referred to as sync variables. According to pro- 
grammers’ comments, they are implemented this way due 
to either flexibility or performance reasons. 

Figure 1(a)(b)(c)(d) show four real world examples 
of ad hoc synchronizations from MySQL, Mozilla, and 
OpenLDAP. In each example, a thread is waiting for some 
other threads by repetitively checking on one or more 
shared variables, i.e. sync variables. Each case has its own 
specific implementation, and it is also not obviously appar- 
ent that a thread is synchronizing with another thread. 

Unfortunately, there have been few studies on ad hoc 
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/* “wait for the other guy finish 
(not efficient, but rare)” */ 

while (crc_table_empty); 

write_table(.., crc_table[0]); 


for (deleted=0; ;) { 


references to dbmfp */ 


f MYSQL? if (dbmfp->ref == 1) { 


(a) direct spinning 
/* wait until some waiting threads enter */ 
while(group->waiter->count == 0) { 








deleted = 1; 


/* abort if the group is not running */ 
if(group->state != _prmw_running) { 
PR_SetError(..); 
goto aborted; 
} } 
} /* Mozilla */ 


if (deleted) break; 
__os_sleep(dbenv, 1, 0); 











THREAD_LOCK(..., dbmp->mutex); 
/* wait for other threads to release their 


2 
if (F_ISSET(dbmfp, MP_OPEN_CALLED))\ 
TAILQ_REMOVE(&dbmp->dbmfq, ..); 7 


} Ss 
THREAD_UNLOCK(..., dbmp->mutex); \ } 


/* wait for operations on tables from other threads*/ 
new_activity_counter = 0; 
background_loop: 
tables_to_drop = drop_tables_in_background(); 
if(tables_to_drop > 0) 
os_thread_sleep(100000); 
while(n_pages_purged) { 


log_butfer_flush_to_disk(); 


/* new activities come in, go active and serve */ 
if(new_activity_counter > 0) 

goto loop; 
else goto background_loop; 


a 











/* OpenLDAP */ /* MySQL 7 





(b) multiple exit conditions 


(c) control dependency 


(d) useful work inside waiting loop 


Figure |: Real world examples of ad hoc synchronizations. Sync variables are highlighted using bold fonts. Example (a) directly 
spins on the sync variable; (b) checks more than one sync variables, (c) takes a certain control path to exit after checking a sync 


variable, (d) performs some useful work inside the waiting loop. 


synchronization. It is unclear how commonly it is used, 
how programmers implement it, what issues are associated 
with it, whether it is error-prone or not. 


1.1 Contribution 1: Ad Hoc Synchroniza- 
tion Study 


In the first part of our work, we conduct a “forensic inves- 
tigation” of 229 ad hoc synchronizations in 12 concurrent 
programs of various types (server, desktop and scientific), 
including Apache, MySQL, Mozilla, OpenLDAP, etc. The 
goal of our study is to understand the characteristics and 
implications of ad hoc synchronization in existing concur- 
rent programs. 

Our study has revealed several interesting, alarming and 
quantitative characteristics as follows: 


(1) Every studied concurrent program uses ad hoc syn- 
chronization. More specifically, there are 6-83 ad hoc 
synchronizations implemented using ad hoc loops in each 
of the 12 studied programs. The fact that programmers 
often use ad hoc synchronization is likely due to two pri- 
mary reasons: (i) Unlike typical atomicity synchroniza- 
tion, when coordinating execution order among threads, 
the intended synchronization scenario may vary from one 
to another, making it hard to use a common interface 
to fit every need (more discussion follows below and in 
Section 2); (ii) Performance concerns make some of the 
heavy-weight synchronization primitives less applicable. 


(2) Although almost all ad hoc synchronizations are im- 
plemented using loops, the implementations are diverse, 
making it hard to manually identify them among the thou- 
sands of computation loops. For example, Figure 1(a) di- 
rectly spins on a shared variable; Figure 1(b) has multi- 
ple exit conditions; Figure 1(c) shows the exit condition 
indirectly depends on the sync variable and needs com- 
plicated calculation to determine whether to exit the loop; 
Figure 1(d) synchronizes on program states and performs 
useful work while checking whether the remote thread has 
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| Apps. || #ad hoc syne | #buggy sync 
[Apache | 33___-| ‘7 22%) 
OpenLDAP | 15 10 (67%) 


[Cherokee [6 | 360%) 
Mozillajs_[| 17 | 5 G0%) 
[Transmission [| 13 [8 (62%) 





Table 1: Percentages of ad hoc synchronizations that had 
introduced bugs according to the bugzilla databases and 
changelogs of the applications. 


changed the states or not. Such characteristic may par- 
tially explain why programmers use ad hoc synchroniza- 
tions. More discussion and examples are in Section 2. 


(3) Ad hoc synchronizations are error-prone. Table 1 
shows that among the five software systems we studied, 
signficant percentages (22-67%) of ad hoc synchroniza- 
tions introduced bugs. Although some experts may expect 
such results, our study is among the first to provide some 
quantitative results to back up this observation. 

Ad hoc synchronization can easily introduce deadlocks 
or hangs. As shown on Figure 2, Apache had a deadlock in 
one of its ad hoc synchronizations. It holds a mutex while 
waiting on a sync variable “queue_info—idlers”. Figure 3 
shows another deadlock example in MySQL, which has 
never been reported previously. More details and the real 
world examples are in Section 2. 

Because they are different from deadlocks caused by 
locks or other synchronization primitives, deadlocks in- 
volving ad hoc synchronizations are very hard to detect 
using existing tools or model checkers [11, 43, 24]. These 
tools cannot recognize ad hoc synchronizations unless 
these synchronizations are annotated manually by pro- 
grammers or automatically by our SyncFinder described 
in section 1.2. For the same reason, it is also hard for con- 
currency testing tools such as ConTest [6] to expose these 
deadlock bugs during testing. 

Furthermore, ad hoc synchronizations also have prob- 
lems interacting with modern hardware’s weak memory 
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listener thread : 
apr_thread_mutex_lock(&m); 
while(!ring_empty(..) as 
&& expiration_time<timeout apr_atomic_inc32( queue_ 
&& get_worker(&idle_worker)){ h info->idlers) ; 


worker thread: 
apr_thread_mutex_lock(&m); 


} 
} 
get_worker‘(..){ 
while(queue_info->idlers==0);—_ 
} /* Apache */ 


change log: “Never hold mutex while calling blocking operations ” 











Figure 2: A deadlock introduced by an ad hoc synchro- 
nization in Apache. 





Hold : mutex Hold : protect_global_read 
Wait : global_read_block (thread 3) Wait : mutex (thread 1) 
Thread 1 Thread 2 


S1.1 pthread_mutex_lock(&mutex); S2.1 protect_global_read ++; 


S2.2 pthread_mutex_lock(&mutex); 
S2.3 protect_global_read --; 


$1.2 while(global_read_block) {...} 
$1.3 pthr¢ad_mutex_unlock(&mutex); 




















Thread 3 
$3.1 global_read_block ++; 


$3.2 while(protect_global_read > 0) {...} 


$3.3 global_read_block --; 


Hold : global_read_block; 
Wait : protect_global_read (thread 2) 











/* MySQL */ 


Figure 3: A deadlock caused by a circular wait among 
three threads (This is a new deadlock detected by our dead- 
lock detector leveraging SyncFinder’s auto-annotation). 
Thread 2 is waiting at $2.2 for the lock to be released by 
thread 1; thread 1 is waiting at S1.2 for thread 3 to decrease 
the counter at S3.3; and thread 3 is waiting at S3.2 for thread 2 
to decrease another counter at $2.3. 


consistency model and also with some compiler optimiza- 
tions, e.g. loop invariant hoisting (discussed further in 
Section 2). 

By studying the comments associated with ad hoc syn- 
chronizations, we found that some programmers knew 
their implementations might not be safe or optimal, but 
they still decided to keep their ad hoc implementations. 


(4) Ad hoc synchronizations significantly impact the effec- 
tiveness and accuracy of various bug detection and per- 
formance tuning tools. Since most bug detection tools 
cannot recognize ad hoc synchronizations, they can miss 
many bugs related to those synchronizations, as well as 
introduce many false positives (details and examples in 
Section 2). For the same reason, performance profiling 
and tuning tools may confuse ad hoc synchronizations 
for computation loops, thus generating inaccurate or even 
misleading results. 


1.2. Contribution 2: 
Synchronizations 


Identifying Ad Hoc 


Our characteristic study on ad hoc synchronization reveals 
that ad hoc synchronization is often harmful with respect 
to software correctness and performance. The first step 
to address the issues raised by ad hoc synchronization is 


to identify and annotate them, similar to the way that type 
annotation helps Deputy [9] and SafeDrive [50] to identify 
memory issues in Linux. Specifically, if ad hoc synchro- 
nizations are annotated in concurrent programs, (1) static 
or dynamic concurrency bug (e.g. data race and deadlock) 
detectors can leverage such annotations to detect more 
bugs and prune more false positives caused by ad hoc syn- 
chronizations; (2) performance tools can be extended to 
capture bottlenecks related to these synchronizations; (3) 
new programming language/model designers can study ad 
hoc synchronizations to design or revise language con- 
structs; (4) programmers can port such ad hoc synchro- 
nizations to more structured implementations. 
Unfortunately, ad hoc synchronizations are very hard 
and time-consuming to recognize and annotate manu- 
ally. Partly because of this, although some annotation 
languages for synchronizations like Sun Microsystems’ 
Lock_Lint [2] have been available for several years, they 
are rarely used, even in Sun’s own code [35]. Further- 
more, manual examination is also error-prone. Figure 4 
shows a MySQL ad hoc synchronization example that we 
missed during the manual identification we conducted for 
our characteristic study. Fortunately, our automatic iden- 
tification tool SyncFinder found it. We overlooked this 
example because of the complicated nested “goto” loops. 





loop: 
if(shutdown_state > 0) 
goto background_loop; 


me /* MySQL “7 
background_loop: 
/* background operations */ 
if(new_activity_counter > 0) 
goto loop; 
else 
goto background_loop; 


if(shutdown_state == EXIT) 
os_thread_exit(NULL) 
goto loop; 











Figure 4: An ad hoc synchronization missed in our manual 
identification process of our characteristic study but is iden- 
tified by our auto-identification tool, SyncFinder. The inter- 
locked “goto” loops can easily be missed by manual identifica- 
tion (Figure 1(d) shows more detailed code). 


Motivated by the above reasons, the second part of our 
work involved building a tool called SyncFinder to auto- 
matically identify and annotate ad hoc synchronizations 
in concurrent programs. SyncFinder statically analyzes 
source code using inter-procedural, control and data flow 
analysis, and leverages several of our observations and in- 
sights gained from our study to distinguish ad hoc syn- 
chronizations apart from thousands of computation loops. 

We evaluate SyncFinder with 25 concurrent programs 
including the 12 used in our characteristic study and 13 
others. SyncFinder automatically identifies and annotates 
96% of ad hoc synchronization loops with 6% false posi- 
tives on average. 

To demonstrate the benefits of auto-annotation of ad 
hoc synchronizations by SyncFinder, we design and eval- 
uate two use cases. In the first use case, we build a sim- 
ple wait-inside-critical-section detector, which can iden- 
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Total | Ad hoc 

Apps. Desc. Loc. jeeps ioape 
Apache 2.2.14 Web server 228K 1462 33 
MySQL 5.0.86 Database server 1.0M 4265 83 
OpenLDAP 2.4.21 LDAP server 272K 2044 15 
Cherokee 0.99.44 Web server 60K 748 6 
Mozilla-js 0.9.1 JS engine 214K 848 17 
PBZip2 2-1.1.1 Parallel bzip2 3.6K 45 7 
Transmission 1.83 BitTorrent client 96K 1114 13 
Radiosity SPLASH-2 14K 80 12 
Barnes SPLASH-2 2.3K 88 7 
Water SPLASH-2 1.5K 84 9 
Ocean SPLASH-2 4.0K 339 20 
FFT SPLASH-2 1.0K 57 7 























Table 2: The number of ad hoc synchronizations in concur- 
rent programs we studied. Ad hoc sync is implemented with 
an ad hoc loop using shared variables (i.e., sync variables) in it. 


tify deadlock and bad programming practices involving ad 
hoc synchronizations. In our evaluation, our tool detects 
five deadlocks that are missed by previous deadlock detec- 
tion tools in Apache, MySql and Mozilla, and, moreover, 
two of the five are new bugs and have never been reported 
before. In addition, even though some(16) of the detected 
issues are not deadlocks, they are still bad practices and 
may introduce some performance issues or future dead- 
locks. The synchronization waiting loop inside a critical 
section protected by locks can potentially cause cascading 
wait effects among threads. 

As the second use case, we extend the Valgrind [33] 
data race checker to leverage the ad hoc synchronization 
information annotated by SyncFinder. As a result, Val- 
grind’s false positive rates for data races decrease by 43- 
86%. This indicates that even though SyncFinder is not a 
bug detector itself, it can help concurrency bug detectors 
improve their accuracy by providing ad hoc synchroniza- 
tion information. 


2 Ad Hoc Synchronization Characteristics 
To understand ad hoc synchronization characteristics, we 
have manually studied 12 representative applications of 
three types (server, desktop and scientific/graphic), as 
shown on Table 2. Two inspectors separately investigated 
almost every line of source code and compared the results 
with each other. As shown on Table 3, in our initial study, 
we missed a few ad hoc synchronizations, most of which 
are those implemented using interlocked or nested goto 
loops (e.g., the example in Figure 4). Fortunately, our 
automatic identification tool, SyncFinder, discovers them, 
and we were able to extend our manual examination to in- 
clude such complicated types. 


Threats to Validity. Similar to previous work, charac- 
teristic studies are all subject to the validity problem. Po- 
tential threats to the validity of our characteristic study are 
the representativeness of applications and our examination 
methodology. To address the former, we chose a variety of 
concurrent programs, including four servers, three clien- 
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[Apps [Payne Toops [Ta [Tp [oth] 
[Apache [334 [27 2 | 


P Myson [83 ps7] 
Open DAP || 15] 3. [3 [2 
| 





Table 3: Ad hoc sync loops missed by human inspec- 
tions. Two inspectors, J, and J,, investigate the same 
source code separately. Most of the sync loops missed 
by both inspectors (i.e., those in Apache and MySQL) are 
interlocked or nested goto loops. Others (in OpenLDAP) 
are for-loops doing complicated useful work and checking 
synchronization condition in it, like one in Figure 1(d). 


t/desktop concurrent applications as well as five scientific 
applications from SPLASH-?2, all written in C/C++, one of 
the popular languages for concurrent programs. These ap- 
plications are well representative of server, client/desktop- 
based and scientific applications, three large classes of 
concurrent programs. 

In terms of our examination methodology, we have ex- 
amined almost every line of code including programmers’ 
comments. This was an immensely time consuming effort 
that took three months of our time. To ensure correctness, 
the process was repeated twice, each time by a different 
author. Furthermore, we were also quite familiar with the 
examined applications, since we have modified and used 
them in many of our previously published studies. 

Overall, while we cannot draw any general conclusions 
that can be applied to all concurrent programs, we believe 
that our study does capture the characteristics of synchro- 
nizations in three large important classes of concurrent ap- 
plications written in C/C++. 


Finding 1: Every studied application uses ad hoc syn- 
chronizations. More specifically, there are 6-83 ad hoc 
synchronizations in each of the 12 studied programs. 
As shown in Table 2, ad hoc synchronizations are used in 
all of our evaluated programs, and some programs (e.g. 
MySQL) even use as many as 83 ad hoc synchronizations. 
This indicates that, in the real world, it is not rare for pro- 
grammers to use ad hoc synchronizations in their concur- 
rent programs. 

While we are not 100% sure why programmers use ad 
hoc synchronizations, after studying the code and com- 
ments, we speculate there are two primary reasons. The 
first is because there are diverse synchronization needs to 
ensure execution order among threads. Unlike atomicity 
synchronization that shares a common goal, the exact syn- 
chronization scenario for order ensurance may vary from 
one to another, making it hard to design a common inter- 
face to fit every need (more discussion in Finding 2). 

The second reason is due to performance concerns on 
synchronization primitives, especially those heavyweight 
ones implemented as system calls. If the synchronization 
condition can be satisfied quickly, there is no need to pay 
the high overhead of context switches and system calls. 
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Total |) Total Single exit condition 

Apps. sc sc | sc sc 
loops || Ad hoc | -dir | -df | -cf | -func 

Apache 1462 33 4 0 1 3 
MySQL | 4265 83 23 5 4 11 
OpenLDAP | 2044 15 2 0 0 2 
Cherokee | 748 6 0 2 0 1 
Mozilla-js | 848 17 2 4 1 4 
PBZip2 45 7 0 0 0 1 
Transmission] 1114 13 6 0 0 1 
Radiosity 80 12 5 5 1 0 
Barnes 88 7 6 1 0 0 
Water 84 9 9 0 0 0 
Ocean 339 20 20 0 0 0 
FFT 57 7 at 0 0 0 





























Multiple exit cond. | Total 
mec | mc async | sc : single exit cond. 
total -dir: directly depends 
total -all |-Nall ve func a a oe ty 

8 22 3 25 16 25 -df: has data 

2B | 13 |27| 40 32 | 64 JGependeney 
-cf : has control 

4 4 7 11 9 15 dependency 

3 0 3 3 1 5 
mc : multiple exit cond. 

10 4 1 3 5 15 -all: all exit conditions 

1 0 6 6 7 7 depend on sync vars 
-Nall : not all, but at least 

7 0 6 6 3 2 one does 

11 1 0 1 0 1 ; 

7 0 0 0 0 0 func: inter-procedural 

dependency 

9 0 0 0 0 0 async: useful work while 

20 0 0 0 0 0 waiting 

7 0 0 0 0 0 




















Table 4: Diverse ad hoc synchronizations in concurrent programs we studied. (i) The number of exit conditions in synchroniza- 
tion loops are various (sc vs. mc); (ii) There can be multiple, different types of dependency relations between sync variables and loop 
exit conditions (-dir, -df, -cf, -func); (iii) Some synchronization loops do useful work with asynchronous condition checking (async). 





while(1) { 
int oldcount = (global->barrier).count; 


If(updatedcount == oldcount) break; 








} /* SPLASH2 */ a 
!queue 
(a) sc-df (data dependency) queue 
int finished = 0; break; 
for(i= 0;i< 1000 && I finished; | ++) } 
if(global->pbar_count >= n_proc) 


finished = 1; 











} /* Radiosity*/ 


/* wait for the next block from 
a producer queue */ 
safe_mutex_ 


Bool queue::getData(ElemPtr &fileData) { 
ElemPtr &headElem = qData[head]; 
lock(fifo->m ut); bee 


/* search qData to find the requested 
block. If finds out, return true; 
otherwise, return false */ 


/* PBZip2 */ 


->empty && 
->getData(Data) ) 








(b) mc-Nall (Some are local exit conditions) 


(c) Function call 


Figure 5: Examples of various ad hoc synchronizations. A sync variable is highlighted using a bold font. An arrow shows the 
dependency relation from a sync variable to a loop-exit condition. The examples of other ad hoc categories are shown on Figure 1. 


Such performance justifications are frequently mentioned 
in programmers’ comments associated with ad hoc syn- 
chronization implementations. 

While ad hoc synchronizations are seemly justified, are 
they really worthwhile? What are their impact on pro- 
gram correctness and interaction with other tools? Can 
they be expressed using some common, easy-to-recognize 
synchronization primitives? We will dive into these ques- 
tions in our finding 3 and 4, trying to shed some lights into 
the tradeoffs. 


Finding 2: Ad hoc synchronization is diverse. 

Table 4 further categorizes ad hoc synchronizations from 
several perspectives. Some real world examples for each 
category can be found in Figure | and Figure 5. 

(i) Single vs. multiple exit conditions: Some ad hoc syn- 
chronization loops have only one exit condition '. We 
call such sync loops sc loops. Unfortunately, many oth- 
ers (up to 86% of ad hoc synchronizations in a program) 
have more than one exit condition. We refer to them as 
mc loops. In some of them (referred to as mc-_all), all exit 


"A condition that can break the execution out of a loop. 
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conditions are satisfied by remote threads. In the other 
loops (referred to as mc_Nall), there are also some local 
exit conditions such as time-outs, etc., that are indepen- 
dent of remote threads and can be satisfied locally. 


(ii) Dependency on sync variables: The simplest ad hoc 
synchronization is just directly spinning on a sync variable 
as shown on Figure 1(a). In many other cases (50-100% 
of ad hoc synchronizations in a program), exit conditions 
indirectly depend on sync variables via data dependencies 
(referred to as df, Figure 5(a)), control dependencies (re- 
ferred to as cf, (Figure 1(c)), even inter-procedural depen- 
dencies (referred to as func, Figure 5(c)). 


(iii) Asynchronous synchronizations (referred as async): 
In some cases (77% of ad hoc synchronizations in 
server/desktop applications we studied), a thread does 
not just wait in synchronization. Instead, it also per- 
forms some useful computations while repetitively check- 
ing sync variables at every iteration. For example, in Fig- 
ure 1(d), a MySQL master thread does background tasks 
like log flushing until a new SQL query arrives (by check- 
ing new_activity_counter). 
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for (;;) { 
if (m_skip_auto_increment && 
readAutolncrementValue(...) 
Il getAutoIncrementValue(...){ 
if (--retries && ...) { 
my_sleep(retry_sleep); 
continue; —_/* 30 ms sleep 
for transaction */ 


/* get tuple Id of a table */ 
do { 
ret= m_skip_auto_increment ? 
readAutolncrementValue(...): 
getAutolncrementValue(:::); 
} while(ret== -1 && --retries && ..); 


} 
} break; 














/* MySQL */||} 





Figure 6: An ad hoc synchronization in MySQL was revised 
by programmers to solve a performance problem. 


Finding 3: Ad hoc synchronizations can easily intro- 
duce bugs or performance issues. 

After studying the 5 applications listed in Table 1, we 
found that 22-67% of synchronization loops previously 
introduced bugs or performance issues. These high issue 
rates are alarming, and, as a whole, may be a strong sign 
that programmers should stay away from ad hoc synchro- 
nizations. 

For each ad hoc synchronization loop, we use its corre- 
sponding file and function names to find out in the source 
code repository if there was any patch associated with it. 
If there is, we manually check if the patch involves the ad 
hoc sync loop. We then uses this patch’s information to 
search the bugzilla databases and commit logs to find all 
relevant information. By examining such information as 
well as the patch code, we identify whether the patch is a 
feature addition, a bug not related to synchronization, or 
a bug caused exactly by the ad hoc sync loop. We only 
count the last case. 

Besides deadlocks (as demonstrated in Figure 2 and 3), 
ad hoc synchronization can also introduce other types of 
concurrency bugs. In some cases, an ad hoc synchroniza- 
tion fails to guarantee an expected order and lead to a crash 
because the exit condition can be satisfied by a third thread 
unexpectedly. Due to space limitations, we do not show 
those examples here. 

In addition to bugs, ad hoc synchronizations can also 
introduce performance issues. Figure 6 shows such an ex- 
ample. In this case, the busy wait can waste CPU cycles 
and decrease throughput. Therefore, programmers revised 
the synchronization by adding a sleep inside the loop. 

Ad hoc synchronizations also have problematic interac- 
tions with modern hardware’s relaxed consistency mod- 
els [5, 28, 45]. These modern microprocessors can reorder 
two writes to different locations, making ad hoc synchro- 
nizations such as the one in Figure 1(a) fail to guarantee 
the intended order in some cases. As such, experts rec- 
ommended programmers to stay away from such ad hoc 
synchronization implementations, or at least implement 
synchronizations using atomic instructions instead of just 
simple reads or writes [5, 28, 45]. 

To make things even worse, ad hoc synchronizations 
also have problematic interactions with compiler opti- 
mizations such as loop invariant hoisting. Programmers 
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Comment examples 
Programmers are aware of better design but still 
use ad hoc implementation (8%) 
/* This can be built in smarter way, like pthread_cond, 
but we do it since the status can come from.. */ 
/* By doing.. applications will get better performance and 
avoid the problem entirely. Regardless, we do this... 
because we’d rather write error message in this routine, ..*/ 


Programmers try to prevent bugs at the first place (22%) 
/* We could end up spinning indefinitely with a situation 
where.. The ‘i++’ stops the infinite loop *//* We can safely 
wait here in the case.. without fear of deadlock because we 
made.. */ /* This spinning actually isn’t necessary except 
when the compiler does corrupt 64bit arithmetic.. */ 


Programmers explicitly state their sync assumptions (75%) 

/* GC doesn’t set the flag until it has waited for all active 
requests to end */ /* We must break the wait if one of the 
following occurs: i).. ii).. iii).. iv).. v).. */ 








Table 5: Observations in programmers’ comments on ad 
hoc synchronization from Apache, Mozilla, and MySQL. We 
study 63 comments associated with ad hoc synchronizations. 


should avoid such optimizations on sync variables, and 
ensure that waiting loops always read the up-to-date val- 
ues instead of the cached values from registers. As a 
workaround, programmers may need to use wrapping vari- 
able accesses with function calls [3]. All of these just 
complicate programming as well as software testing and 
debugging. 

Interestingly, some programmers are aware of the above 
ad hoc synchronization problems but still use them. We 
study the 63 comments associated with ad hoc synchro- 
nizations in MySQL, Apache, and Mozilla. As illustrated 
in Table 5, programmers sometimes mentioned better al- 
ternatives, but they still chose to use their ad hoc imple- 
mentations for flexibility. In some cases, they explicitly 
indicated their preference for the lightness and simplicity 
of ad hoc spinning loops, especially when the synchro- 
nizations were expected to rarely occur or rarely need to 
wait long. Also, programmers often explicitly stated their 
assumptions/expectation in comments about what remote 
threads should do correspondingly, since ad hoc synchro- 
nizations are complex and hard to understand. 


Finding 4: Ad hoc synchronizations can significantly 
impact the effectiveness and accuracy of concurrency 
bug detection and performance profiling tools. 

As mentioned earlier, since existing concurrency bug 
(deadlock, data race) detection tools cannot recognize ad 
hoc synchronizations, they will fail to detect bugs that 
involve such synchronizations (e.g. deadlock examples 
shown on Figure 2 and 3). 

In addition, they can also introduce many false posi- 
tives. It has been well known that most data race detectors 
incur high false positives due to ad hoc synchronizations. 
Such false positives come from two sources: (1) Benign 
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Thread 1 
#define LAST_PHASE 1 
loop: 
if(state < LAST_PHASE) 
goto loop; 


Thread 2 


state = EXIT_THREADS; 











/* MySQL */ 


#define EXIT_THREADS 3)|S1_ q_info->pools = new_recycle; 


$2  atomic_inc( &(q_info->idlers) ); 


Worker Listener 


S3_ while(q_info->idlers == 0) {...} 
Fe, S4 first_pool = q_info->pools; 
a /* Apache */ 








(a) Benign data race on state 


(b) False data race on q_info->pools 


Figure 7: False positives in Valgrind data race detection due to ad hoc synchronizations. 


data races on sync variables: typically an ad hoc synchro- 
nization is implemented via an intended data race on sync 
variables. Figure 7(a) shows such a benign data race re- 
ported by Valgrind [33] in MySQL. (2) False data races 
that would never execute in parallel due to the execution 
order guaranteed by ad hoc synchronizations: For exam- 
ple, in Figure 7(b), the two threads are synchronized at S2 
and $3, which guarantees the correct order between S1 and 
S4’s accesses to g_info—pools. S1 and S4 would never 
race with each other. However, most data race checkers 
cannot recognize this ad hoc synchronization and, as a re- 
sult, incorrectly report S1 and S4 as a data race. 

Synchronization is also a big performance and scala- 
bility concern because time waiting at synchronization is 
wasted. Unfortunately, existing work in synchronization 
cost analysis [25, 32] and performance profiling [29] can- 
not recognize ad hoc synchronizations, and therefore the 
synchronizations can easily be mistaken as computation. 
As a result, the final performance profiling results may 
cause programmers to make less optimal or even incorrect 
decisions while performance tuning. 


Replacing with synchronization primitives. Our find- 
ings above reveal that ad hoc synchronization is often 
harmful in several respects. Therefore, it is desirable 
that programmers use synchronization primitives such 
as cond_wait, rather than ad hoc synchronization. Fig- 
ure 8 shows how ad hoc synchronization can be replaced 
with a well-known synchronization primitive, POSIX 
pthread_cond_wait(). Note that it may not always be 
straightforward to use existing synchronization primitives 
to replace all ad hoc synchronizations, because existing 
synchronization primitives may not be sufficient to meet 





/* “wait for the other guy to finish while(1) { 

(not efficient, but rare)” */ int oldcount = (global->barrier).count; 
while (crc_table_empty); i 
write_table(out, crc_table[0]); if(updatedcount == oldcount) break; 





f ry 
ai Jt 





pthread_mutex_lock(&mutex); pthread_mutex_lock(&mutex); 


while (crc_table_empty) { while(1) { 
pthread_cond_wait(&cond_var, int oldcount = (golbal->barrier).count; 
&mutex); ona, 
if(updatecount == oldcount) break; 


pthread_mutex_unlock(&mutex); pthread_cond_wait(&cond_var, &mutex); 





pthread_mutex_unlock(&mutex); 


(b) SPLASH2 


write_table(.., crc_table[0]); 


(a) MySQL 














Figure 8: Replacing ad hoc synchronizations with synchro- 
nization primitives using condition variables. (a) shows the 
re-implementation of ad hoc synchronization in Figure 1(a); (b) 
is for Figure 5(a). 


the diverse synchronization needs as well as the perfor- 
mance requirements, as discussed in Finding 1. 


3 Ad hoc Synchronization Identification 


3.1 Overview 


As ad hoc synchronizations have raised many challenges 
and issues related to correctness and performance, it would 
be useful to identify and annotate them. Manually doing 
this is tedious and error-prone since they are diverse and 
hard to tell apart from computation. Therefore, the second 
part of our work builds a tool called SyncFinder to auto- 
matically identify and annotate them in the source code of 
concurrent programs. The annotation can be leveraged in 
several ways as discussed in Section 1.2. 

There are two possible approaches to achieve the above 
goal. One is dynamic and is done by analyzing run-time 
traces. The other approach is static, involving the analysis 
of source code. Even though the dynamic approach has 
more accurate information than the static method, it can 
incur large (up to 30X [27]) run-time overhead to collect 
memory access traces. In addition, the number of ad hoc 
synchronizations that can be identified using this method 
would largely depend on the code coverage of test cases. 
Also some ad hoc synchronization loops may terminate 
after only one iteration, making it hard to identify them as 
ad hoc synchronization loops [18]. Due to these reasons, 
we choose the static method, i.e., analyzing source code. 

The biggest challenge to automatically identify ad hoc 

synchronizations is how to separate them from computa- 
tion loops. The diversity of ad hoc synchronizations makes 
it especially hard. To address the above challenge, we have 
to identify the common elements among various ad hoc 
synchronization implementations. 
Commonality among ad hoc synchronizations: Interest- 
ingly, ad hoc synchronizations are all implemented using 
loops, referred to as sync loops (Figure 9). While a sync 
loop can have many exit conditions, at least one of them 
is the exit condition to be satisfied when an expected syn- 
chronization event happens. We refer to such exit condi- 
tions as sync conditions. The sync condition directly or 
indirectly depends on a certain shared variable (referred 
as a sync variable) that is loop-invariant locally, and mod- 
ified by a remote thread. 

Note that a sync variable may not necessarily be directly 
used by a sync condition (e.g., inside a while loop condi- 
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Figure 10: SyncFinder design to automatically identify and annotate ad hoc synchronization 


flag : Synchronization variable, W flag :Synchronization write 
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Sync. Sync. if R flag 
loop —aFiflag  @Wilag loop / s} *._. 
CGF \ W flag 
compu- 
tation 
(a) spinning (b) asynchronous checking 


Figure 9: Ad hoc synchronization abstract model. The loop 
exit condition (i.e., sync condition) either directly or indirectly 
depends on a sync variable. 


tion). Instead, a syne condition may have data/control- 
dependency on it like in the examples shown on Fig- 
ure 1(c) and Figure 5(a)(c). 

Following the above characteristic, SyncFinder starts 
from loops in the target programs, and examines their exit 
conditions to identify those that are (1) loop invariant, (2) 
directly or indirectly depend on a shared variable, and (3) 
can be satisfied by a remote thread’s update to this vari- 
able. By checking these constraints, SyncFinder filters out 
most computation loops as shown in our evaluation. 

Checking all of the above conditions requires 
SyncFinder to conduct (1) program analysis to know 
the exit conditions for each loop; (2) data and control flow 
analysis to know the dependencies of exit conditions; 
(3) some static thread analysis to conservatively identify 
what segment of code may run concurrently; and (4) some 
simple satisfiability analysis to check whether the remote 
update to the sync variable can satisfy the sync condition. 

As shown on Figure 10, SyncFinder consists of the fol- 
lowing steps: (1) Loop detection and exit condition ex- 
traction; (2) Exit dependent variable (EDV) identification; 
(3) Pruning computation and condvar loops based on char- 
acteristics of EDVs; (4) Synchronization pairing to pair an 
identified sync loop with a remote update that would break 
the program out of this sync loop; (5) Final result reporting 
and annotation in the target program’s source code. 

SyncFinder is built on top of the LLVM compiler in- 
frastructure [23] since it provides several useful basic fea- 
tures that SyncFinder needs. LLVM’s intermediate repre- 
sentation (IR) is based on single static assignment (SSA) 
form, which automatically provides a compact definition- 
use graph and control flow graph for every function, 
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both of which can be leveraged by SyncFinder’s data-, 
and control-flow analysis. In addition, SyncFinder also 
uses LLVM’s loopinfo analysis, alias analysis, and con- 
stant propagation tracking to implement the ad hoc sync 
loop identification algorithm. SyncFinder annotation is 
done via the static instrumentation interfaces provided in 
LLVM. In the rest of this section, we focus on our algo- 
rithms and do not go into details about the basic analysis 
provided by LLVM. 


3.2 Finding Loops 


[Apps] while [for | gow [Toul] 
[Apache _[[- 27 [4 _[ 2 


3 
[Myson] 33_ [4 26 [8 
[OpentDAP [7 [4 | 4 | 3 _| 
[ Mozillajs [124 1_| 17] 





Table 6: Loop mechanisms used for real-world ad hoc syn- 
chronization. There are a non-negligible number of ”goto” 
loops, which often complicate loop analysis (e.g., Figure 4). 


As shown in Table 6, ad hoc synchronizations are imple- 
mented using three primary forms of loops: “while”, “for” 
and “goto”. Fortunately, LLVM’s loopinfo pass identifies 
all those loops based on back edges in LLVM IR. 

For each loop identified by LLVM, SyncFinder extracts 
its exit conditions. Specifically, it identifies the basic 
blocks with at least one successor outside of the loop, then 
for each identified basic block, SyncFinder extracts its ter- 
minator instruction, from which SyncFinder can identify 
the branch conditions. Such conditions are the exit con- 
ditions for this loop. SyncFinder represents the exit con- 
ditions in a canonical form: disjunction (OR) of multiple 
conditions, and examines each separately. 

In addition, since LLVM does not keep the loop context 
information, e.g., loop headers and bodies, across func- 
tions, SyncFinder keeps track of them into its own data 
structure and uses them throughout the analysis. 


3.3 Identifying Sync Loops 


The key challenge of SyncFinder is to differentiate sync 
loops from computation loops. To address this challenge, 
SyncFinder examines the exit conditions of each loop by 
going through the following steps to filter out computation 
loops. 
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Figure 11: Leaf-EDV identification. SyncFinder recursively 
tracks Exit Dependent Variables(EDVs) along the data-, control- 
flow, until it reaches a leaf-EDV. 


(1) Exit Dependent Variable (EDV) analysis : For each 
exit condition of each loop in the target program, the first 
step is to identify all variables that this exit condition de- 
pends on—we refer to them as exit dependent variables 
(EDVs). If a loop is a sync loop, the sync variables 
should be included in its EDVs. Note that a sync vari- 
able is not necessarily used in an exit condition (sync con- 
dition) directly. A loop exit condition can be data/control- 
dependent on a sync variable. Therefore, we conduct data- 
flow and control-flow analysis to find indirect EDVs. The 
EDV identification process is similar to static backward 
slicing [48, 38, 15]. 

SyncFinder first starts from variables directly refer- 
enced in the exit condition. They are added into an EDV 
set. Then, as shown in Figure 11, it pops a variable out 
from the EDV set, and finds out new EDVs along this vari- 
able’s data/control flow. New EDVs are inserted into the 
set. It then pops another EDV from the set, and so on so 
forth until it reaches the loop boundary. For an EDV that 
does not depend on any other variables inside this loop, we 
refer them as a leaf-EDV (similar to “live-in” variables). 
SyncFinder maintains a separate set for leaf-EDVs. Ob- 
viously, leaf-EDVs are the ones we should focus on since 
they are not derived from any other EDVs in this loop. 

During the backward data/control flow tracking process, 
if the dependency analysis encounters a function whose 
return value or passed-by-reference arguments affect the 
loop exit condition, SyncFinder further tracks the depen- 
dency via inter-procedural analysis. SyncFinder applies 
data- and control-flow analysis starting from the function’s 
return value, and identifies Return/arguments-Dependent 
Variables (RDVs) in the callee. Such RDVs are also added 
into the leaf-EDV set. In addition, all RDVs of this func- 
tion are stored in a summary to avoid analyzing this func- 
tion again for other loops. 

To handle variable and function pointer aliasing, 
SyncFinder leverages and extends LLVM’s alias analysis 
to allow it go beyond function boundary. 

(2) Pruning computation loops For every exit condition 
of a loop, SyncFinder applies the following two pruning 
steps to check whether it is a sync condition. At the end, 
if a loop has at least one sync condition, it is identified as 
a sync loop. Otherwise, it is pruned out as a computation 


loop. Most computation loops are filtered in this phase. 


Non-shared variable pruning: A sync variable should 
be a shared variable that can be set by a remote thread. 
Specifically, it should be either a global variable, a heap 
object, or a data object (even stack-based) that is passed to 
a function (e.g., thread starter function) called by another 
thread, which can be shared by the two threads. 

Therefore, if an exit condition has no shared variables in 
its leaf-EDV set, it is deleted from the loop’s exit condition 
set. SyncFinder moves to the next exit condition of this 
loop. If the loop has no exit conditions left, this loop is 
pruned out as a computation loop. 


Loop-variant based pruning: — In almost all cases, a 
sync condition is loop-invariant locally, and only a remote 
thread changes the result of the sync condition. Based on 
this observation, SyncFinder prunes out those exit condi- 
tions that are loop-variant locally as shown on Figure 12. 
It is possible that some ad hoc synchronizations may also 
change the sync conditions locally. In all our experiments 
with 25 concurrent programs, we did not find any true ad 
hoc synchronizations that SyncFinder missed due to this 
pruner. Note that some exit conditions, such as expiration 
time, are separated as different conditions, and we exam- 
ine each condition separately. 





while(module){ 
next = module->next; 

free(module); 
module = next; =} /* Mozilla */) | } /* SPLASH / 
(a) Loop-variant module (b) Loop-variant condition checking 


for (i = 0; i < nlights; i++){ 
VecMatMult(Ip->pos, m, Ip->pos); 
Ip = lp->next; 

















Figure 12: The non-syne variables pruned out by loop- 
variant based pruning. In the two computation loops, the vari- 
ables in italic font are shared variable leaf-EDVs. 


To check if an exit condition is loop variant, SyncFinder 

applies a modification (IZOD) analysis within the scope 
of a loop being examined. Specifically, it checks all leaf- 
EDVs and leaf-RDVs of this loop, and prunes out those 
modified locally within this loop. The leaf-RDV summary 
is also updated accordingly. 
(3) Pruning condvar loops: SyncFinder does not con- 
sider condvar loops (i.e., sync loops that are associated 
with cond_wait primitives) as ad hoc loops as they can be 
easily recognized by intercepting or instrumenting these 
primitives. As the final step of the ad hoc sync loop identi- 
fication, SyncFinder checks every loop candidate to see it 
calls a cond_wait primitive inside the loop. Loops that use 
primitives are recognized as condvar loops and are thereby 
pruned out. The names of cond_wait primitives(original 
pthread functions or wrappers) are provided as input to 
SyncFinder to identify cond_wait calls. 


3.4 Synchronization Pairing 


Once we identify a potential sync loop, we find the re- 
mote update (referred as a sync write) that would “release” 
(break) the wait loop. To identify a sync write, SyncFinder 
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[Apps [| total_[ constant [ince op 
[Apache [|_42 [21 60.0%) [51.9% 


MySQL || 325_| 125 G8.5%) | 110 G3.8%) 
OpenLDAP [| 203 | 48 (23.6%) | 8.9%) 
Mozilla-js_ [|] 83 | 41 (49.4%) 31 (37.3%) 





Table 7: The characteristics of writes to sync variables. In 
the four sampled applications, majority of writes assign constant 
values, or use simple increase or decrease operations. 


first collects all write instructions modifying sync variable 
candidates, and then applies the following pruning steps. 
Pruning unsatisfiable remote updates For each remote 
update to the target sync variable candidate, SyncFinder 
analyzes what value is assigned to this variable, and 
whether it can satisfy the sync condition. A complicated 
solution to achieve this functionality is to use a SAT solver. 
But it is too heavyweight, especially since, according to 
our observations (shown in Table 7), the majority(66%) 
of sync writes either assign constant values to sync vari- 
ables, or use simple counting operations like incremen- 
t/decrement, rather than complicated computations. This 
is because a sync variable is usually a control variable (e.g. 
status, flag, etc.) and does not require sophisticated com- 
putations. 

Therefore, instead of using a SAT solver, we use con- 
stant propagation to check if this remote update would sat- 
isfy the exit condition. For an assignment with a constant, 
it substitutes the variable with the constant, and propagates 
it till the exit condition to see if it is satisfiable or not. For 
increment based updates, SyncFinder treats it as “sync var 
> 0” since it obviously does not release the loop that is 
waiting for an exit condition “(sync var == 0)”. 


Pruning serial pairs A sync loop and a sync write should 
be able to execute concurrently. If there is a happens- 
before relation between such pair, due to thread cre- 
ation/join, barrier, etc, the remote write does not match 
with the sync loop. Due to the limitation of static analysis, 
currently SyncFinder conservatively prunes serial pairs re- 
lated to only thread creation/join. Specifically, SyncFinder 
follows thread creation and conservatively estimates code 
that might be running concurrently. 


3.5 SyncFinder Annotation 


After the above pruning process, the remaining ones 
are identified as sync loops, along with their corre- 
sponding sync writes. All the results are stored in a 
file. SyncFinder also automatically annotates in the tar- 
get software’s source code using LLVM static instru- 
mentation framework. It inserts //#SyncAnnotation: 
Sync_Loop_Begin(&loopId), //#SyncAnnotation: 
Sync_Loop-End(&loopId), respectively, at the begin- 
ning and end of an identified sync loop. In addi- 
tion, inside the loop, it also annotates the read to 
a sync variable by inserting //#SyncAnnotation: 
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Sync_Read(&syncVar, &loopId). For the corre- 
sponding sync write, it inserts //#SyncAnnotation: 
Sync.Write(&syncVar, &loopId). The loopld is 
used to match a remote sync write with a sync loop. Sim- 
ilar annotations are also inserted into the target program’s 
bytecode to be leveraged by concurrency bug detection 
tools as discussed in the next section. 


4 Two Use Cases of SyncFinder 


SyncFinder’s auto-identification can be used by many bug 
detection tools, performance profiling tools, concurrency 
testing frameworks, program language designers, etc. We 
built two use cases to demonstrate its benefits. 


4.1 A Tool to detect bad practices 


It is considered bad practice to wait inside a critical sec- 
tion, as it can easily introduce deadlocks like the Apache 
example shown on Figure 2 and the MySQL example on 
Figure 3. Furthermore, it can result in performance is- 
sues caused by cascading wait effects, and may introduce 
deadlocks in the future if programmers are not careful. As 
a demonstration, we built a simple detector (referred to as 
wait-inside-critical-section detector) to catch these cases 
leveraging SyncFinder’s auto-annotation of ad hoc syn- 
chronizations. Our detection algorithm can be easily in- 
tegrated into any existing deadlock detection tool as well. 

To detect such pattern, our simple detector checks ev- 
ery sync loop annotated by SyncFinder to see if it is per- 
formed while holding some locks. If a sync loop is hold- 
ing a lock, then SyncFinder checks the remote sync write 
to see whether the write is performed after acquiring the 
same lock or after another ad hoc sync loop, so on and so 
forth, to see if it is possible to form a circle. If it is, the 
detector reports it as a potential issue: either a deadlock or 
at least a bad practice. 


4.2 Extensions to data race detection 


We also extend Valgrind [33]’s dynamic data race detec- 
tor to leverage SyncFinder’s auto-identification of ad hoc 
sync loops. Valgrind implements a happens-before algo- 
rithm [21] using logical timestamps, which was originally 
based on conventional primitives including mostly lock 
primitives, and thread creation/join. It cannot recognize ad 
hoc synchronizations. As a result, it can introduce many 
false positives (shown in Table 12) as discussed in Sec- 
tion 2 and illustrated using two examples in Figure 7. 

We extend Valgrind to eliminate data race false posi- 
tives by considering ad hoc synchronizations annotated by 
SyncFinder. It treats the end of a sync loop in a similar 
way to a cond_wait operation, and the corresponding sync 
write like a signal operation. This way it keeps track of 
the happens-before relationship between them. We also 
extend Valgrind to not consider sync variable reads and 
writes as data races. 
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Apps. Total Identified Sync Loops | Missed 
loops |} Total | True FP ones 
Apache 1462 17 15 2 1 
MySQL 4265 48 42 6 3 
. OpenLDAP | 2044 18 14 4 1 
= Cherokee TA8 6 6 0 0 
A|| AOLServer | 496 6 6 0 - 
Nginx 705 12 11 1 - 
BerkeleyDB | 1006 LS 11 4 - 
BIND9 1372 5 4 1 - 
Mozilla-js | 848 16 11 5 1 
a. PBZip2 45 7 7 0 0 
2 Transmission | 1114 14 12 2 1 
a HandBrake | 551 13 3 0 - 
p/zip 1594 10 9 1 - 
wxDFast 154 0 - 
Radiosity 80 12 12 0 0 
Barnes 88 7 7 0 0 
Water 84 9 0 0 
7 Ocean 339 20 20 0 0 
= FFT 57 7 7 0 0 
=|" Cholesky | 362 || 8 8 0 5 
8 RayTracer | 144 3 3 0 - 
FMM 108 8 8 0 - 
Volrend 77 9 9 0 - 
LU 38 0 0 0 - 
Radix 52 14 14 0 - 
Total 290 | 264 26 
(Ave.) ~ 11 (11.6) | (10.6) |} (1.0) - 





























Table 8: Overall results of SyncFinder: Every concurrent 
program uses ad hoc sync loops except LU. Both true ad hoc 
sync loops and false positives are showed here. For the 12 pro- 
grams used in the characteristic study, the numbers of missed ad 
hoc sync loops are also reported. They are generated by com- 
paring with our manual checking results from the characteristic 
study. We cannot show the numbers of missed ad hoc sync loops 
for the unseen programs in the study since we did not manu- 
ally examine them as we did for the 12 studied programs. To 
show SyncFinder’s total exploration space, we also show the to- 
tal number of loops, most of which are computation loops. Note 
that the total numbers of ad hoc sync loops are different from 
those numbers shown in Table 2 because some code (for other 
platforms such as FreeBSD, etc) are not included during the com- 
pilation. 


5 Evaluation 


5.1 Effectiveness and Accuracy 


We evaluated SyncFinder on 25 concurrent programs, 
including 12 used in our manually characteristic study 
and 13 other ones. Table 8 shows the overall result of 
SyncFinder on the 25 programs. On average SyncFinder 
accurately identifies 96% of ad hoc sync loops in the 12 
studied programs and has a 6% false positive rate overall. 
SyncFinder successfully identified diverse ad hoc order 
synchronizations, including those we missed during our 


manual identification. For example, it successfully identi- 
fies those complicated, interlocked “goto” sync loops, as 
shown in Figure 4. 

For the 12 studied programs, SyncFinder misses a 
few(1-3 per application) sync loops in large server/desk- 
top applications. Considering the total number of loops 
(up to 4265) in each of these applications, such a small 
miss rate does not limit SyncFinder’s applicability to real 
world programs. SyncFinder fails to identify these sync 
loops because of the unavailability of the source code for 
these library functions and inaccurate pointer alias. 

SyncFinder also returns a low number of false positives 
for all 25 programs. As showed in Table 8, SyncFinder has 
0-6 false positives per program (i.e. a false positive rate of 
0-30%). Such numbers are quite reasonable. Program- 
mers can easily examine the reported sync loops to prune 
out those few false positives. Most of the false positives 
are caused by inaccurate function pointer analysis. Due to 
complicated function pointer alias, sometimes SyncFinder 
cannot further track into callee functions to check if a tar- 
get variable (leaf-EDV) is locally modified. In these cases, 
SyncFinder conservatively considers the target variable as 
a sync variable. 


5.2 Sync Loop Identification and Pruning 


Apps. Total Exit Leaf- Aft non- | Aft loop- | Aft cond- 

loops cond. EDVs_| -shared pr. | var. pr. var pr. 
[Apache [| 1462 | 3,120 | B82 
aap ante 


[ MysQr [4.265] 
[Opent DAPT| 2.044 [4434 [11276 |__| _43_[ 27 _ 
[ Pezipe [| 2 | 278 | 79 | 10 | 16 [9 





Table 9: EDV Analysis and non-sync variable pruning. After 
identifying leaf-EDVs for each loop, SyncFinder applies non- 
shared, loop-variant and condvar-loop based pruning schemes. 
The final results are the sync variables of the ad hoc sync loops. 
Some sync variables may be associated with a same sync loop. 


To show the effectiveness of sync loop identification, 
in Table 9, we test SyncFinder on some server/desktop 
applications and show the results from each of the sync 
loop identification steps. From the total loops identified, 
SyncFinder extracts exit conditions, and identifies all leaf- 
EDVs (the third column in Table 9). From the leaf-EDVs, 
SyncFinder prunes out non-shared variables (95% of leaf- 
EDVs), and applies loop-variant based pruning, which fur- 
ther prunes 80% of shared leaf-EDVs. SyncFinder then 
applies the final pruning step to prune out sync variables 
that are associated with condvar loops. The remains are 
sync variable candidates and those loops using them are 
potential sync loops. 


5.3. Synchronization Pairing and Pruning 


During synchronization pairing, SyncFinder applies two 
pruning schemes, unsatisfiable remote update pruning and 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) 173 


174 


Apps. Initial w/ Remote w/ Serial With True 
pairs update pr. pair pr. both pairs 


[ Apache [| 27 [2 | 27 | 2 | 2 


[Myson] 331 [ 208 eat 
[Open DAP [[ 168 [__13¢__[_146_ [113 [_ 96 
[ Ppzip2 [9 [1s ff 9 9 





Table 10: False synchronization pair pruning. Note that the 
numbers shown here are synchronization pairs. In all the other 
results, we show “synchronization loops” (regardless how many 
setting statements for an ad hoc sync loop) 


serial pair pruning. Table 10 shows the effect of those 
pruning steps on the same set of server/desktop applica- 
tions in Table 9. First, remote update based pruning elimi- 
nates 51.8% of false sync pair candidates on average. It is 
especially effective on Apache, since the majority of sync 
writes are just simple assignments with constant values, so 
it is easy to determine whether such values would satisfy 
the corresponding sync exit conditions. 

Second, the effectiveness of serial pair pruning depends 
on application characteristics. While it prunes out almost 
all false positives in simple desktop/scientific programs 
(e.g., PBZip2), it is less effective in servers like Apache, 
where many function pointers are used. Due to the limita- 
tion of function pointer analysis, it is hard to know in all 
cases whether two certain regions cannot be concurrent. 
To be conservative, SyncFinder does not prune the pairs 
inside such regions. Fortunately, the remote update based 
pruning helps filtering them out. 


5.4 Two Use Cases: Bug Detection 


[Apps [| Deadlock (New) 
[Apache [[ 10) | 1 _] 
PMysOL[[_ 2) J 
[ Mozilla_[[ 2) 


Table 11: Deadlock and bad practice detection 


Table 11 shows that our simple deadlock detector (leverag- 
ing SyncFinder’s ad hoc synchronization annotation) de- 
tects five deadlocks involving ad hoc order synchroniza- 
tions, including those shown in Figure 2 and Figure 3. 
Previous tools would fail to detect these bugs since they 
cannot recognize ad hoc synchronizations. Besides dead- 
locks, our detector also reports 16 bad practices, i.e. wait- 
ing in a sync loop while holding a lock, which could raise 
performance issues or cause future deadlocks. 


[Apache [| _30__| 7 | 8% | 


[Mysqr [25 | [0%] 
[OpenDAP [7 | 4 | 4%] 
[Water [79 [ar 86%] 





Table 12: False positive reduction in Valgrind 


Table 12 shows that SyncFinder auto-annotation could 
reduce the false positive rates of Valgrind data race detec- 
tor by 43-86%. 
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6 Related Work 


Spin and hang detection Some recent work has been 
proposed in detecting simple spinning-based synchroniza- 
tions [32, 25, 18]. For example, [25] proposed some new 
hardware buffers to detect spinning loops on-the-fly. [18] 
also provides similar capability but does it in software. 
Both can detect only simple spinning loops, i.e. those sync 
loops with only one single exit condition and also directly 
depend on sync variables (referred as “sc-dir’ in Table 3 
in Section 2). As shown in Table 3 such simple spinning 
loops account for less than 16% of ad hoc sync loops on 
average in server/desktop applications we studied. 
Besides, both of them are dynamic approaches and 
thereby suffer from the coverage limitation of all dy- 
namic approaches (discussed in Section 3). In contrast, 
SyncFinder uses a static approach and can detect various 
types of ad hoc synchronizations. Additionally, we also 
conduct an ad hoc synchronization characteristic study. 
Synchronization annotation Many annotation lan- 
guages [4, 2, 1, 41] have been proposed for synchroniza- 
tions in concurrent programs. Unfortunately, annotation 
is not frequently used by programmers since it is tedious. 
SyncFinder is complementary to these work by providing 
automatic annotation for ad hoc synchronizations. 
Concurrent bug detection tools Much research has been 
conducted on concurrency bug detection [47, 20, 31, 6, 
17, 11, 43]. These tools usually assume that they can 
recognize all synchronizations in target programs. As we 
demonstrated using deadlock detection and race detection, 
SyncFinder can help these tools improve their effective- 
ness and accuracy by automatically annotating ad hoc syn- 
chronizations that are hard for them to recognize. 
Transactional memory Various transactional memory 
designs have been proposed to solve the programmability 
issues related to mutexes [39, 30, 19, 44] and also con- 
dition variables [10]. Our study complements such work 
by providing ad hoc synchronization characteristics in real 
world applications. 
Software bug characteristics studies Several studies 
have been conducted on the characteristics of software 
bugs [8, 42, 34], including one of our own [26] on con- 
currency bug characteristics. This paper is different from 
those studies by focusing on ad hoc synchronizations in- 
stead of bugs, even though many of them are prone to in- 
troducing bugs. The purpose of this paper is to raise the 
awareness of ad hoc synchronizations, and to warn pro- 
grammers to avoid them when possible. Also we devel- 
oped an effective way to automatically identify those ad 
hoc synchronizations in large software. 


7 Conclusions and Limitations 


In this paper, we provided a quantitative characteristics 
study of ad hoc synchronization in concurrent programs 
and built a tool called SyncFinder to automatically identify 
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and annotate them. By examining 229 ad hoc synchroniza- 
tion loops from 12 concurrent programs, we have found 
several interesting and alarming characteristics. Among 
them, the most important results include: all concurrent 
programs have used ad hoc synchronizations and their im- 
plementations are very diverse and hard to recognize man- 
ually. Moreover, a large percentage (22-67%) of ad hoc 
loops in these applications have introduced bugs or perfor- 
mance issues. They also greatly impact the accuracy and 
effectiveness of bug detection and performance profiling 
tools. In an effort to detect these ad hoc synchronizations, 
we developed SyncFinder, a tool that successfully identi- 
fies 96% of ad hoc synchronization loops with a 6% false 
positive rate. SyncFinder helps detect deadlocks missed 
by conventional deadlock detection and also reduce data 
race detector’s false positives. Many other tools and re- 
search projects can also benefit from SyncFinder. For ex- 
ample, concurrency testing tools (e.g., CHESS [31]) can 
leverage SyncFinder’s auto-annotation to force a context 
switch inside an ad hoc sync loop to expose concurrency 
bugs. Similarly, performance tools can be extended to pro- 
file ad hoc synchronization behavior. 

All work has limitations, and ours is no exception: (i) 
SyncFinder requires source code. However, this may not 
significantly limit SyncFinder’s applicability since it is 
more likely to be used by programmers instead of end 
users. (ii) Due to some implementation issues, SyncFinder 
still misses 1-3 ad hoc synchronizations. Eliminating 
them would require further enhancement to some of our 
analysis (such as alias analysis, etc.) (iii) Even though 
SyncFinder’s false positive rates are quite low, for some 
use cases that are sensitive to false positives, program- 
mers would need to manually examine the identified ad 
hoc synchronization or leverage some execution synthesis 
tools like ESD [49] to help identify false positives. (iv) For 
our characteristic study, we can always study a few more 
applications, especially of different types. 
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Abstract 


Current multiprocessor systems execute parallel and concurrent 
software nondeterministically: even when given precisely the 
same input, two executions of the same program may produce 
different output. This severely complicates debugging, testing, 
and automatic replication for fault-tolerance. Previous efforts to 
address this issue have focused primarily on record and replay, 
but making execution actually deterministic would address the 
problem at the root. 

Our goals in this work are twofold: (1) to provide fully de- 
terministic execution of arbitrary, unmodified, multithreaded 
programs as an OS service; and (2) to make all sources of in- 
tentional nondeterminism, such as network I/O, be explicit and 
controllable. To this end we propose a new OS abstraction, the 
Deterministic Process Group (DPG). All communication be- 
tween threads and processes internal to a DPG happens de- 
terministically, including implicit communication via shared- 
memory accesses, as well as communication via OS channels 
such as pipes, signals, and the filesystem. To deal with funda- 
mentally nondeterministic external events, our abstraction in- 
cludes the shim layer, a programmable interface that interposes 
on all interaction between a DPG and the external world, mak- 
ing determinism useful even for reactive applications. 

We implemented the DPG abstraction as an extension to 
Linux and demonstrate its benefits with three use cases: plain 
deterministic execution; replicated execution; and record and 
replay by logging just external input. We evaluated our imple- 
mentation on both parallel and reactive workloads, including 
Apache, Chromium, and PARSEC. 


1. Introduction 


Nondeterminism makes the development of parallel and 
concurrent software substantially more difficult. Soft- 
ware testers face daunting incompleteness challenges be- 
cause nondeterminism leads to an exponential explo- 
sion in possible executions [27]. Developers must rea- 
son about large sets of possible behaviors and attempt to 
debug without precise repeatability [31, 36]. Moreover, 
standard techniques for fault-tolerant replication do not 
work when the software being replicated executes nonde- 
terministically [38]. At the same time, the growing pop- 
ularity of multicore architectures is making parallel and 
concurrent software more and more important. 
Unfortunately, nondeterminism is pervasive; thread 
scheduling, memory reordering, and timing variations 
at the hardware level can all affect the interleaving of 
threads and cause a multithreaded program to produce 
different outputs when given the same input. We define 
this as internal nondeterminism. Internal nondetermin- 


ism is entirely hidden from the programmer and thus is 
undesirable. However, as we demonstrate in this paper, it 
is not fundamental and can be completely removed. On 
the other hand, events such as user input and the arrival 
of network packets are triggered nondeterministically by 
the external world. We define this as external nondeter- 
minism; this kind of nondeterminism, if present, is fun- 
damental and cannot be removed. 

What we want is a software environment where in- 
ternal nondeterminism is completely eliminated. What 
we want is more than just deterministic record and re- 
play: multithreaded programs should always execute de- 
terministically relative to their explicitly specified in- 
puts. Moreover, where external nondeterminism exists, 
it should be made explicit and controllable. 

Recent research has begun to explore ways of reduc- 
ing internal nondeterminism in multithreaded programs. 
However, current proposals fall short in several aspects: 
they do not deal with nondeterministic channels other 
than shared-memory; they do not offer ways of making 
external nondeterminism explicit and controllable; they 
either require new hardware [14], apply to only a sub- 
set of programs [7, 29], or require recompilation [6]; and 
they do not support multiprocess applications. 

Our goals are to completely eliminate nondetermin- 
ism where possible, including channels beyond shared- 
memory like pipes, signals, and the filesystem, and to 
make all intentional, external nondeterminism explicit 
and controllable. To this end, we propose a new OS 
abstraction, the Deterministic Process Group (DPG). A 
programmer uses this abstraction to define a determinis- 
tic box inside which all communication happens deter- 
ministically. All of the nondeterministic input received 
by a DPG is interposed upon by the shim layer, an in- 
terface that can be used by programmers to observe and 
control external nondeterminism in a flexible way. 

A DPG is effectively a high-level deterministic vir- 
tual machine. The deterministic guarantees are provided 
transparently by the OS without intervention from the 
programmer; thus, DPGs can host arbitrary, unmodified 
application binaries. At any given time there may be 
many DPGs running alongside many conventional non- 
deterministic processes. An alternative design is full- 
system determinism, in which a hypervisor executes an 
entire OS deterministically relative to inputs triggered by 
the hardware. The DPG approach is more flexible be- 
cause the programmer can select the desired granularity 
of determinism for each individual application. 
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1.1 DPG Use Cases 


Debugging and Testing Many applications do not con- 
tinuously interact with the external world, but instead 
read inputs at deterministic points in their execution. 
Since DPGs provide internal determinism by default, 
these applications will execute completely deterministi- 
cally when run within a DPG. This has obvious bene- 
fits for debugging, since execution is directly repeatable. 
Moreover, removing internal nondeterminism has the po- 
tential to reduce the problem of testing multithreaded 
programs to the problem of testing sequential programs 
by making execution a function of only the explicit in- 
puts, including external nondeterminism. 


Record/Replay Controlling external nondeterminism 
with the shim layer makes determinism useful even for 
applications that interact continuously with the external 
world. As an example, one can run an application in- 
side a DPG and extend the shim layer to log all exter- 
nal nondeterminism. This log can be used later to faith- 
fully replay an application’s execution for debugging and 
other analyses. Most prior work on record and replay 
of multithreaded applications focuses on how to record 
internal nondeterminism caused by shared-memory ac- 
cesses. This leads to either unwieldy logs and high over- 
heads [16, 22] or imprecise replay [1, 31, 36]. The inter- 
nal determinism offered by DPGs completely subsumes 
this problem; only external inputs need to be recorded. 


Replication for Fault Tolerance DPGs naturally en- 
able replication of multithreaded applications. By run- 
ning multiple copies of an application inside DPGs on 
several machines and replicating the inputs, all replicas 
will behave the same way because there is no internal 
nondeterminism. This can be implemented by extending 
the shim layer to ensure that all replicas receive the same 
input at the same point in their execution. Because DPGs 
eliminate all forms of internal nondeterminism, there is 
less to log and replicate. This is a major issue in prior 
work [5, 38-40] on replication mostly because shared- 
memory is a very large source of such nondeterminism. 


1.2. Outline and Contributions 


This paper makes several conceptual and architectural 
contributions. First, we identify the fundamental dis- 
tinction between internal and external nondeterminism, 
and we demonstrate that internal nondeterminism can be 
eliminated from programs. To do this, we expand on ear- 
lier work that removed shared-memory nondeterminism 
by also removing internal nondeterminism from signals, 
pipes, the filesystem, and other OS channels. 

Second, we propose the Deterministic Process Group 
abstraction (Section 2), which lets programmers define 
the boundary between internal and external nondeter- 
minism. As part of this abstraction we introduce the 
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shim layer, whose interface lets programmers observe 
and control all external nondeterminism. 

This paper also presents and evaluates our implemen- 
tation of these ideas. In Sections 3-4, we describe dOS, a 
Linux-based implementation of DPGs and the shim layer 
that enables the deterministic execution of arbitrary, un- 
modified binaries. Section 5 demonstrates the usefulness 
of the shim layer by using it to implement determinis- 
tic filesystem services, replicated execution of a multi- 
threaded server, and record/replay. Section 6 provides a 
detailed evaluation of dOS and our shim applications on 
a variety of workloads. Finally, we end with related work 
and closing remarks. 


2. The Abstraction 


Figure | illustrates the abstract model of a Determin- 
istic Process Group and Figure 2 illustrates the major 
components of our system. A DPG consists of a group 
of threads and processes along with the kernel objects 
they share. Kernel objects include shared-memory pages, 
pipes, and sockets. Threads communicate by performing 
operations on shared kernel objects, for example by read- 
ing from a shared page or writing to a shared pipe. A ker- 
nel object is internal if it can be modified only by threads 
inside the DPG, and is external if it can be modified by 
threads or devices outside the DPG. We refer to a thread 
executing inside a DPG as a deterministic thread, and we 
refer to a DPG’s set of threads and internal objects col- 
lectively as a deterministic box. 

Figure | shows three deterministic threads, Thread,, 
Threadgz, and Threads3, two internal objects, the memory 
page and the pipe, and two external objects, the socket 
and the file. Thread, and Threadz are members of 
the same process, P,. The deterministic box is illustrated 
with a dotted outline. Note that internal objects need not 
be shared by the entire DPG; in this example, the memory 
page is shared by just two threads. 

The final component of a DPG is a user-space ser- 
vice called a shim program. A shim program sits on the 
boundary of a deterministic box, and its job is to in- 
terpose on communication that crosses the deterministic 
boundary. Shim programs are written using a system call 
interface called the shim layer. This interface provides 
new opportunities for systems programmers that we ex- 
plore in detail throughout this paper. 


2.1 DPGs and Their Guarantees 


A new DPG is created with the sys_makedet system 
call, and initially hosts just the calling thread. Each new 
thread spawned by the initial thread is added to the DPG, 
and in this way the DPG expands to include all descen- 
dant threads and processes. A thread leaves a DPG when 
it exits. We have not found DPG join and leave prim- 
itives necessary and so have not defined them. Threads 
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Figure 1. A Deterministic Process Group 


hosted in a DPG need not share an address space, which 
means a DPG can host many multithreaded processes. 

Deterministic threads invoke system calls and read 
and write shared-memory just like ordinary threads. 
However, DPGs distinguish between operations on in- 
ternal objects, which happen deterministically, and oper- 
ations on external objects, which happen nondetermin- 
istically. Interactions with external objects represent a 
DPG’s only source of nondeterminism; essentially, these 
external interactions represent the inputs a DPG receives 
from the external world. 

Given the same initial state and the same stream of 
external inputs, a DPG is guaranteed to execute the 
same steps of inter-thread communication and produce 
the same output. More precisely, as a DPG executes it 
performs shared-memory loads and stores, invokes 
system calls, and handles asynchronous signals; each of 
these operations introduces nondeterminism only when 
it involves an entity outside the DPG. This is a stronger 
guarantee than output determinism [1, 23], which guar- 
antees that replaying a program will produce the same 
output, but not that it will reproduce all inter-thread com- 
munication steps that lead to that output. 

For example, when operating on a network socket, the 
read system call returns nondeterministic data. Addi- 
tionally, read is a blocking call; it does not return un- 
til data is available, which means read will block for a 
nondeterministic amount of time. However, when read 
operates on a device that is internal to a DPG, such as an 
internal pipe, read behaves deterministically. 

In summary, a DPG experiences nondeterminism only 
when it: (1) reads data from an external source; (2) 
blocks to wait for external data; or (3) handles a sig- 
nal sent from an external source. Our guarantee is that 
DPGs execute deterministically relative to a stream of 
such nondeterministic input, and also relative to the ini- 
tial state of the DPG at the call to sys_makedet. Note 
that this guarantee holds even across different machines. 


Logical time Conceptually, a DPG executes as if it was 
serialized onto a logical timeline, where logical time is 
represented by a single global counter. Blocking system 
calls occupy two points on the logical timeline, one to 
initiate the call and the other to complete the call. dOS 
ensures that internal communication is mapped onto the 
logical timeline in a deterministic way. (Section 3 de- 
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Figure 2. System Overview 


void shim_attach(tid, SYS|SIG) 
void shim_trace(*event) 

void shim_resume(tid, result) 
void shim_queue_sig(tid, siginfo) 
void shim_ctl(tid, ...) 


(a) Interposing on Nondeterminism 


void shim_sleep (tid) 
void shim_add_barrier (tid, logical_time) 
int shim_gettime (tid) 

(b) Controlling Logical Time 





Figure 3. Shim layer system calls 


scribes how our implementation groups instructions into 
atomic epochs in order to extract parallelism.) Note that 
logical time and physical time are distinct: DPGs guar- 
antee deterministic output, but not deterministic perfor- 
mance. Input from the external world is mapped onto the 
logical timeline in a way controlled by the shim layer, 
which is the subject of the next section. 


2.2 The Shim Layer 


Every DPG is monitored by a user-space service called a 
shim program, also referred to as a shim. Shim programs 
use the shim layer interface (Figure 3) to observe and 
control nondeterministic input. 

At a high level, there are two kinds of nondetermin- 
istic input: the what and the when. The what includes 
the values of external input, such as data read from the 
network. Then when includes the blocking times of non- 
deterministic system calls, as well as the delivery times 
of external signals. Shims can observe and control both 
kinds of nondeterministic input. 

As a motivating example, consider record and replay 
implemented with a pair of shim programs. The record 
shim observes execution: for every nondeterministic sys- 
tem call, the shim logs the number of logical time steps 
the call spent blocked, along with the return value of the 
call. The replay shim controls execution: it ensures that 
every nondeterministic system call is scheduled to return 
at the specific logical time and with the specific value 
specified in the log. 

The following sections first describe how shims ob- 
serve and control the what (Figure 3a), and then how 
shims observe and control the when (Figure 3b), using 
record and replay as running examples. 
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Figure 4. Observing a blocking system call 


2.2.1 Interposing on Nondeterminism 


Shims use shim_trace to wait for a DPG to encounter 
nondeterminism. shim_trace blocks until either (a) a 
deterministic thread is about to perform a nondeterminis- 
tic system call, or (b) an external signal is about to be de- 
livered to a deterministic thread. In both cases, the deter- 
ministic thread stalls, execution transfers to the shim pro- 
gram, and shim_trace returns. The shim can interpose 
on this nondeterministic event and then return control 
back to the deterministic thread by calling shim_resume. 
In this way, execution of a deterministic thread alternates 
between itself and a shim program, much like execution 
of an ordinary thread alternates between user-space and 
kernel-space. 

For system calls, shim trace populates the given 
event structure with the system call number and argu- 
ments. The shim should perform the system call on be- 
half of the deterministic thread and then transfer control 
back to deterministic thread by calling shim_resume, us- 
ing the result parameter of shim_resume to specify the 
system call’s return value. The shim might perform the 
call by forwarding the call to the OS (e.g., for record) or 
by ignoring the OS entirely (e.g., for replay). 

For external signals, the event structure includes the 
siginfo_t of the pending signal. The shim can queue 
the signal for delivery by calling shim_queue-_sig, save 
the signal internally for later delivery, or discard the 
signal entirely. In each case, the shim returns control to 
the deterministic thread by calling shim_resume with an 
empty result. 


2.2.2 Controlling Logical Time 


A shim program monitors the passage of logical time 
in a DPG by registering logical time barriers using 
shim_add_barrier. A logical time barrier is a timer tied 
to a specific deterministic thread (through the tid param- 
eter); when the timer goes off, the deterministic thread 
stalls and the shim is notified through shim_trace. The 
barrier time is specified as an offset relative to the cur- 
rent logical time of the DPG, which can be obtained with 
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Figure 5. Controlling a blocking system call 


shim_gettime. Time barriers can be used to control the 
nondeterministic when, as described below. 


System Call Blocking Time Figures 4 and 5 illustrate 
how to observe and control the number of logical time 
steps that a system call blocks. Both examples follow a 
similar pattern; the only difference is the way in which 
shim_add_barrier is called. 

Figures 4 illustrates observing a blocking system call 
(e.g., for record). When deterministic thread T’ performs 
a system call (a), the call is trapped by the shim, which 
returns from shim_trace. At this point, thread T stalls 
and the DPG’s logical time does not advance. The shim 
can now forward the call to the OS, but before doing 
so it puts T’ to sleep by calling shim_sleep (b). While 
T is asleep it is detached from the logical timeline and 
does not execute; this allows a nondeterministic amount 
of logical time to pass in the DPG while the system call 
is being performed. When the system call finally com- 
pletes, the shim synchronizes with the DPG by register- 
ing a time barrier for T’ to happen at the very next logical 
time step in the DPG (c). Once that barrier triggers, the 
shim returns control to 7’ via shim_resume (d). 

Figure 5 illustrates controlling a blocking system call 
(e.g., for replay). Again, deterministic thread 7’ per- 
forms a system call which is trapped by the shim via 
shim_trace (a). The key difference in this example is 
that the shim decides, a priori, that the system call should 
complete in exactly n logical time steps. For example, a 
replay shim would read n from a log. To enforce this, 
the shim registers a barrier for T' that will trigger n steps 
in the future and then puts T to sleep (b). While 7’ is 
asleep it does not execute; the rest of the DPG executes 
normally for exactly n logical time steps, but no further. 
At this point the barrier triggers: T’ wakes and notifies 
the shim (c). Finally, the shim returns from the system 
call and returns control to thread T (d). 


Signal Delivery Time Now suppose a shim wants to 
deliver a signal to thread T' at logical time n. To do 
this, the shim should simply register a barrier for time 
n. When that barrier is reached, the shim can queue the 
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signal for immediate delivery using shim_queue_sig 
and then resume the thread using shim_resume. 


2.2.3. Shim Use Cases 


Shim programs can implement the record/replay and 
replicated execution services discussed earlier, but we 
envision many other kinds of shim programs as well. 
Some shims will be generic, application-independent 
services written by systems programmers, while others 
will be written by application programmers and tailored 
to enhance a specific application. Additionally, a shim 
program can be used to adjust the boundary of a deter- 
ministic box in two ways described below. 


Expanding the Set of Deterministic Services An OS 
that supports DPGs may decide to implement some sys- 
tem calls nondeterministically to reduce kernel complex- 
ity, even when deterministic implementations are possi- 
ble under the right assumptions. For example, in dOS, 
interaction with local files remains nondeterministic due 
to variations in disk latency, even though this nondeter- 
minism can be considered internal and thus eliminated 
under the right assumptions. Section 5.1 explores how a 
shim can make local file access deterministic. 

Further, a shim can virtualize global resources such 
as process identifiers in a deterministic way, as in [30]. 
A shim can even convert physical times (e.g., used by 
sleep and alarm) into virtual, logical times. This would 
eliminate nondeterminism introduced by real time, but 
of course is only meaningful for applications that do not 
require a precise correspondence with real time. 


Customizing the Nondeterministic Interface System 
calls are a DPG’s basic interface to the nondeterministic 
world. However, it is often beneficial to let applications 
define the nondeterministic interface at a more abstract 
level. For example, a server application might want to 
hide many low-level read and write system calls be- 
hind a single high-level, nondeterministic getmsg call. 
Previous work has argued that this flexibility is valuable 
for record/replay systems [19], but we consider this flex- 
ibility to be even more general; for example, Section 5.3 
shows how it is useful for replicated execution. 

We enable this flexibility in dOS by defining a new 
system call, dpg_callshim, which makes a direct call 
from a DPG into its shim. Effectively, dpg_callshim 
allows developers to divide an application into two parts: 
the deterministic part that runs in a DPG and the nonde- 
terministic part that runs in a shim. 


3. Deterministic Execution Algorithm 


The first implementation choice we make is which al- 
gorithm dOS uses to enforce determinism. Prior work 
on shared-memory determinism has proposed a family 
of deterministic execution algorithms, including DMP- 
O, DmP-B, and DMP-TM [6, 14]. dOS implements the 
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Figure 6. Timeline of a quantum round in DMp-O. 
Ty) finishes its quantum in parallel mode (a), while T 
and T3 have work left for serial mode (b,c). 


DmMpP-O ownership-tracking algorithm; we selected it for 
its relative simplicity of implementation, but any deter- 
ministic execution algorithm can support DPGs as long 
as the shim layer can be implemented on top of it. 

One constraint imposed by the shim layer is that log- 
ical time should be representable with a single global 
counter. DMPp-O, DMP-B, and DMP-TM all satisfy this 
constraint, but other (as yet uninvented) algorithms may 
require a more complex notion of logical time, such as 
a vector clock. We believe the shim layer could be ex- 
tended to support such algorithms, but the details are left 
for future work. 

In the rest of this section, we first summarize our 
earlier work on using DMP-O to enforce deterministic 
execution of multithreaded programs that communicate 
via shared-memory. Next, we describe how to generalize 
DmpP-O to include communication via channels other 
than shared-memory, such as pipes and signals. 


3.1 Shared-Memory Determinism 


Two key observations underlie DMP-O. First, if threads 
do not touch shared data, i.e., if they do not commu- 
nicate, their execution will be deterministic no matter 
how they are scheduled. Second, when threads do com- 
municate, a trivial deterministic schedule is to divide 
each thread’s execution into chunks and then execute all 
chunks in a deterministic serial order. 

Following these observations, execution in DMP-O is 
divided into chunks called quanta. A round consists of 
all threads executing one quantum each. Each round is 
divided into a parallel mode and a serial mode. In paral- 
lel mode, threads run in parallel but are isolated; they do 
not communicate. In serial mode, threads run serially but 
can communicate arbitrarily. A thread ends its parallel 
mode once it has reached an instruction that might com- 
municate with other threads. Serial mode begins once all 
threads have completed parallel mode, and ends once all 
threads have had a chance to run. The parallel and serial 
modes are thus isolated by global barriers into two-stage 
rounds, as illustrated in Figure 6. 

Notice that the parallel and serial modes are directly 
inspired from the two key observations stated above. 
DmP-O is deterministic as long as threads are (1) bro- 
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ken into quanta at deterministic boundaries, (2) ordered 
deterministically in serial mode, and (3) correctly iso- 
lated in parallel mode. The first two constraints are easily 
satisfied: we define a quantum to be some deterministic 
number of dynamic instructions, and we order threads in 
serial mode by sorting them by creation order. 

Dmp-O achieves isolation in parallel mode by par- 
titioning ownership of shared-memory across threads. 
Each memory location is in one of two ownership states: 
owned-by-T' for some thread T, or shared. A location 
that is owned-by-T is private to T; no other thread can 
access the location during parallel mode. A location that 
is shared is globally read-only; all threads can read the 
location during parallel mode, but none can modify it. A 
thread waits for serial mode before performing an opera- 
tion that does not meet these conditions. 

Ownership states evolve during serial mode by fol- 
lowing two rules: (1) before thread T writes to a location, 
it sets ownership of that location to owned-by-T;; and (2) 
before T reads a location that is not owned-by-T,, it sets 
ownership of that location to shared. 


Logical time Finally, we say that logical time incre- 
ments on every mode transition, i.e., on every transition 
from parallel mode to serial mode and back. Note that 
within a single mode every thread appears to execute 
atomically. From this property it follows that mode tran- 
sitions are meaningful increments of logical time. 


3.2 Beyond Shared-Memory Communication 


Our model of a DPG from Figure 1 is that threads com- 
municate by performing operations on shared kernel ob- 
jects, which includes more than just shared-memory. To 
generalize DMP-O to this model we first observe that 
we can track ownership of shared kernel objects just as 
for shared-memory locations: if an operation mutates a 
kernel object it acts as a “write,’ while if an operation 
only observes a kernel object it acts as a “read.” In fact, 
our implementation (Section 4.1.1) tracks ownership of 
shared-memory at the page granularity, effectively treat- 
ing a memory page as just another kernel object. 

To fully generalize DMP-O we need two additional 
changes: the first deals with blocking operations, and the 
second deals with asynchronously delivered signals. 


Blocking Operations When a system call blocks, the 
calling thread ends its current mode (either parallel mode 
or serial mode) and is not scheduled to run again until it 
unblocks. While a thread is blocked, the rest of the DPG 
continues to execute. A thread can only unblock during 
a mode transition; this ensures that threads unblock at 
discrete points on the logical timeline. 


Signal Delivery Incoming signals are queued during 
the current mode then delivered immediately on the next 
mode transition. Queued signals are partitioned into in- 
ternal and external signals, depending on whether they 
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were sent from a thread inside or outside the DPG, re- 
spectively. If there are N threads in a DPG then each 
deterministic thread has N logical queues: one queue for 
external signals and one queue for signals sent from each 
of the NV — 1 other deterministic threads. 

On a mode transition, internal signals are delivered 
first and external signals are delivered last. The internal 
signal queues are emptied in a deterministic order, e.g., 
by using the ID of the sending thread as a sort key. 

This strategy ensures, first, that internal signals are de- 
livered deterministically, and second, that external sig- 
nals are delivered at meaningful logical times. Note that 
when a thread sends a signal to itself (as with SIGSEGV) 
the signal is synchronous; such signals are always deliv- 
ered instantly. (OS implements the N-queue model de- 
scribed here using a single sorted list. Additionally, dOS 
always delivers external SIGKILL signals immediately 
(rather than forwarding them to the shim) so that a DPG 
can be killed even when its shim program misbehaves. 


4. Linux-Based Implementation 


We now describe how we implemented dOS, which is a 
variant of Linux that implements the DPG abstraction. 
dOS makes two major changes to Linux: first, it imple- 
ments the shim layer; and second, it implements DMP-O, 
which includes an object ownership-tracking mechanism 
and a deterministic scheduler that constrains the execu- 
tion of each DPG to a deterministic logical timeline. dOS 
exports a traditional system call interface to DPGs along 
with the sys_makedet and dpg_callshim system calls. 

Our implementation of DMP-O was the most chal- 
lenging and invasive change. Overall, we added roughly 
5800 lines of new code to the Linux 2.6.24-7/x86-64 ker- 
nel and changed roughly 2500 lines of existing code in 53 
files. Below we summarize the low-level implementation 
details of dOS and discuss engineering challenges (Sec- 
tions 4.1-4.4). We end with a summary of the strengths 
and limitations of our implementation (Section 4.5). 


4.1. Ownership Tracking 
4.1.1 Shared-Memory Pages 


dOS tracks ownership of shared-memory at the page 
granularity by using hardware page-protection to verify 
that a deterministic thread does not access a page without 
appropriate ownership. 

Conceptually, dOS maintains a shadow page table 
for each thread. A thread’s shadow table mirrors its real 
page table exactly, except that shadow permission bits 
are modified to reflect the current distribution of page 
ownership. dOS exposes only the shadow page tables to 
hardware: on a context switch to thread 7, dOS installs 
T’s shadow table onto the CPU even if the previously 
scheduled thread shared an address space with T’. 
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Page ownership is encoded into shadow page table 
permissions so that ownership violations such as a store 
to a shared page will trigger a page fault. dOS intercepts 
this page fault, notices it is due to an ownership violation, 
and stalls the faulting thread until it is scheduled to run 
in serial mode. dOS then assigns ownership of the page 
to the faulting thread and continues its execution. 

Every conventional process has one real page table 
representing its address space. All address space modifi- 
cations are expressed in terms of the real page table and 
then transparently applied to the shadows. To limit mem- 
ory overheads, dOS maintains just N shadow tables per 
address space, where N is the number of CPUs, and then 
assigns threads to shadow tables, effectively bucketing 
the threads in a given process into N ownership groups. 
This requires a slight tweak to the DMP-O scheduler: 
during parallel mode, all threads that share a shadow 
table must be serialized in a deterministic order (e.g., 
scheduled serially in thread creation order). We bucket 
threads using a simple greedy algorithm. 

This strategy is not limited to shared-memory within 
a single process. dOS supports shared-memory across 
processes by tracking ownership of physical pages; we 
use Linux’s rmap facility to enumerate all user-space 
addresses that map a given physical page. 

Finally, there are two comer cases worth mention- 
ing. First, (OS disables address space randomization for 
DPGs so that every DPG has a deterministic address 
space layout. Note that we can enable address space ran- 
domization in DPGs if we expose the seed as external 
nondeterminism. Second, page swapping can introduce 
nondeterministic changes to page tables. To preserve de- 
terminism, when a page is swapped out, dOS preserves 
the page’s ownership state using extra bits in the shadow 
page tables. When a page fault triggers a swap-in, dOS 
stalls the thread until the page is read from disk, and then 
restores the saved ownership state of the page. 


4.1.2 Other Kernel Objects 


Other kernel objects, such as pipes and sockets, are op- 
erated on by system calls. dOS instruments the kernel so 
that a system call never operates on a kernel object unless 
the calling thread has the appropriate level of ownership. 

Adding this instrumentation presents two engineering 
challenges. First, where should the instrumentation be 
placed? It is tempting to lazily acquire ownership of an 
object just before a system call actually uses the object, 
but doing this requires reengineering kernel locking pro- 
tocols. To see why, note that acquiring ownership may re- 
quire sleeping the calling thread to wait for serial mode. 
However, a system call may not decide to use an object 
until inside an atomic region, e.g., while holding a spin 
lock, and it is not safe to sleep in such regions. 

dOS avoids this difficulty by conservatively acquiring 
ownership of all objects a system call may use before 
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Table 1. System call behaviors 


executing the call. This requires adding instrumentation 
in just two places: at the system call entry point, and in 
the code that wakes up a thread. dOS instruments thread 
wakeup to reacquire any privileges lost while the system 
call was asleep, e.g., while waiting for input. 

The second challenge is that Linux is a large, com- 
plex system with over 250 system calls and many unique 
types of kernel objects. To simplify our implementation, 
we track a few kinds of kernel objects precisely and then 
conservatively merge all other kinds of objects into an 
untracked objects group. For all but the untracked ob- 
jects, dOS tracks ownership using a hash table that maps 
an object to its current owner. Freshly allocated objects 
are initially owned-by the allocating thread. Ownership 
of the untracked objects is implicit: during parallel mode 
they are shared; and during serial mode they are owned- 
by the thread currently running. Thus, read-only opera- 
tions on untracked objects can execute in parallel mode, 
while all other operations on untracked objects must wait 
for serial mode. This strategy is summarized in Table 1. 

An inode is Linux’s internal name for files, sockets, 
pipes, and anything else that can be referenced by a file 
descriptor. System calls like read that operate on file 
descriptors can modify the contents of memory pages, 
map new pages into the address space, or even modify the 
inode itself. These system calls must acquire ownership 
of all of these objects before proceeding. 


4.2 Scheduling 


The dOS scheduler is implemented as a filter in front of 
the default Linux scheduler—it does not push a deter- 
ministic thread into the Linux scheduler until the thread 
has been scheduled to run by its DPG. This filter imple- 
ments the DMP-O scheduling algorithm. 


Thread Creation The fork and clone system calls al- 
ways execute in serial mode. This ensures that determin- 
istic threads are spawned in a global serial order. The 
newly spawned thread will be scheduled to run during 
the next parallel mode. 


Logical Time Barriers The dOS scheduler checks for 
pending time barriers on each mode transition. To pre- 
vent deadlock, dOS instantly fast-forwards logical time 
to the next pending time barrier whenever all threads in 
a DPG are simultaneously asleep. 
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Quantum Formation Recall that parallel mode ends 
when all threads have either reached a quantum bound- 
ary or stalled to acquire ownership, and serial mode 
ends once all threads have reached a quantum boundary, 
where quantum boundaries must occur at deterministic 
points in a thread’s execution. A possible implementa- 
tion is to mark quantum boundaries with system calls, 
but this does not guarantee forward progress because 
a thread may loop forever without making any system 
calls. Additionally it does not guarantee balance; im- 
balance leads to excessive waiting at the end of parallel 
mode, which leads to poor performance [6]. 

To guarantee forward progress, dOS defines a quan- 
tum budget, which is the maximum amount of work a 
thread can perform in a quantum. dOS estimates work 
by counting instructions. The quantum budget is simply 
a deterministic number of instructions, typically in the 
range of tens to hundreds of thousands of instructions. 

dOS counts instructions using the hardware “instruc- 
tions retired” counter that is available on all modern x86 
CPUs. dOS configures this counter to trigger an overflow 
interrupt after the quantum budget expires. There are 
well-documented caveats about using this counter [15, 
43]. Specifically, the counter suffers from nondetermin- 
ism that can be engineered around. We follow the so- 
lution outlined by [15]: to overcome imprecise interrupt 
delivery, dOS must single-step the DPG (via the x86 trap 
flag) for up to about 200 instructions per quantum, which 
can introduce large overheads. To avoid those overheads, 
as an optimization, dOS deterministically ends a quan- 
tum when returning from a system call if the remaining 
quantum budget is low, but not yet exhausted. 


4.3 Additional Optimizations 


As demonstrated in [6], DMpP-O performs best when par- 
allel mode is balanced and when serial mode is empty. 
dOS implements a few optimizations to bias execution 
towards these conditions. dOS automatically adjusts a 
DPG’s quantum budget: when dOS detects significant 
parallel mode imbalance, the budget is decreased to re- 
duce imbalance, and when dOS detects well-balanced 
parallel modes, the budget is increased to reduce quan- 
tum barrier overheads. To limit the time spent executing 
in serial mode, dOS ends a quantum after a few (heuris- 
tically determined) ownership transfers. All of these op- 
timizations preserve determinism, since the parameters 
used evolve deterministically. 


4.4 Shim Programs 


In concrete terms, a shim program is composed of a 
collection of threads called shim threads. Shim threads 
begin life as ordinary user-space threads, e.g., after being 
spawned by fork or clone. An ordinary thread becomes 
a shim thread by calling shim_attach to attach to some 
deterministic thread 7’. Once attached to 7’, the shim 
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thread is the distinguished thread that will intercept all 
of T’s nondeterminism through shim_trace. If the shim 
thread crashes, 7’ will stall on external operations until 
attached to by another shim thread. 

A thread can act as a shim thread for more than one 
deterministic thread. Additionally, to simplify the imple- 
mentation of shims, a shim thread can elect to receive 
only the nondeterministic system calls or only the exter- 
nal signals for a given deterministic thread (by setting the 
second parameter of shim_attach). Our usual strategy 
is to spawn one shim process for every DPG. Within this 
process we spawn one shim thread to intercept signals for 
the entire DPG, and for every deterministic thread in the 
DPG we spawn one shim thread to interpose on the sys- 
tem calls performed by the corresponding DPG thread. 


Intercepting System Calls When a shim program inter- 
cepts a system call it has two options: (1) it can emulate 
the system call completely; or (2) it can simply instru- 
ment the system call’s entry and exit, allowing the deter- 
ministic thread to actually execute the body of the system 
call. These options resemble those allowed by ptrace. 
The option to simply instrument a system call is se- 
lected by passing a special result to shim_resume. This 
option gives a shim limited control over how the system 
call executes in logical time. For example, if a shim sim- 
ply instruments read instead of emulating it, the shim 
cannot observe or control when the kernel writes to the 
given user-space buffer (the writes will happen nonde- 
terministically, in an unrecordable way). We provide in- 
strumentation as a convenience for cases where full em- 
ulation is not necessary. During system call emulation, a 
shim can use shim_ct1 to perform side effects in a DPG, 
such as writing to or reading from a user-space buffer. 


RDTSC_ dOS allows shim programs to interpose on the 
nondeterministic RDTSC instruction. Our implementation 
uses the time stamp disable flag of the x86 cr4 register 
to fault on user-mode accesses to RDTSC; these events are 
exposed to the shim via shim_trace. 


4.5 Discussion 


Guarantees Provided by dOS_ dOS guarantees that 
communication via the following kernel objects is de- 
terministic as long as the objects are completely inter- 
nal to a given DPG: shared-memory pages, including 
across multiple processes; pipes allocated with pipe; 
and futexes (used to implement pthreads synchroniza- 
tion). Additionally, dOS guarantees that file descriptors 
and memory pages are allocated in a deterministic order; 
that the address space evolves deterministically (as via 
mmap); that internal signals are delivered deterministi- 
cally; and that wait is deterministically notified when 
threads in the same DPG exit. 

Note that some system calls are deterministic except 
in error cases. For example, mmap allocates pages deter- 
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ministically within an address space, but will fail non- 
deterministically if there is not enough physical memory 
available to service the request. 


Guarantees Not Provided by dOS Our deterministic 
guarantees may not translate across different versions of 
program binaries no matter how slightly different (e.g., 
after a patch). Also, although our guarantees hold across 
different host machines, an application can read host con- 
figuration as part of its inputs, for example to dynami- 
cally adjust its resource usage; these inputs must be du- 
plicated exactly to guarantee determinism. 

Additionally, dOS does not guarantee deterministic 
access to shared-memory pages that can be modified by 
threads or devices outside the DPG. Ideally we might in- 
terpose on this external communication using the shim, 
but this would require adding excessive restrictions to 
non-DPG processes. For example, page ownership might 
transition between “exclusive to a DPG” and “exclusive 
to the external world,” but this would require stalling ex- 
ternal threads as they wait to reacquire page ownership. 
Relatedly, DPGs may encounter nondeterminism when 
memory is modified through backdoors in /proc. 


Retrospective Implementing DPGs in a monolithic ker- 
nel such as Linux raises many thorny issues. The exam- 
ple of mmap is instructive: reasoning about the cases in 
which mmap is nondeterministic requires finding and rea- 
soning about many code paths in a monolithic kernel. 

More generally, providing determinism requires track- 
ing and mediating accesses to shared OS objects. How- 
ever, many Linux kernel objects have aliased names, 
are named in multiple namespaces, and are accessible 
through multiple interfaces. For example, process IDs are 
exposed through system calls, the /proc filesystem in- 
terface, and in some cases, thread-local storage variables 
in the address space of a multithreaded process. If we 
consider PIDs to be a source of internal nondeterminism, 
dOS must correctly track and reconcile PIDs through all 
of these channels, for instance, by virtualizing PID num- 
bers before they are exposed to a program so that PID 
assignment is deterministic and consistent across pro- 
cesses within a DPG. Even if we consider PIDs a source 
external nondeterminism (the choice made by dOS), for 
record/replay to work correctly a shim program must in- 
terpose on all of these different channels for accessing 
PIDs, so that PIDs can be recorded and during replay the 
same PIDs can be reassigned. 

An OS kernel implemented “from scratch” to support 
DPGs would benefit from design principles advocated 
by exokernels and microkernels. A minimal kernel in- 
terface combined with a libOS would push many of the 
aliased interfaces and complex code paths out of the ker- 
nel and inside the user-space deterministic box, making 
it easier to reason about determinism at the system call 
layer. The protection domains of a microkernel could fur- 


ther simplify many of these issues, since reasoning about 
nondeterminism would largely reduce to detecting mes- 
sages that cross the boundary of a deterministic box. In 
the mmap example, this might be a message to the page- 
allocation server. 


5. Shim Applications 


To demonstrate the usefulness of the shim layer, we 
have implemented three shims: deterministic filesystem 
services; record/replay by logging just external input; 
and replicated execution of a multithreaded server. The 
deterministic filesystem service and record/replay shims 
can be used with unmodified application binaries, while 
the replicated execution shim is application specific. We 
note that the shim layer allowed us to quickly prototype 
the shims described in this section. 


5.1 Deterministic Filesystem Services 


FSSHIM provides applications with a deterministic file 
hierarchy. All reads and writes to files within this hierar- 
chy are deterministic; accesses to files outside of this hi- 
erarchy are considered sources of external nondetermin- 
ism, as before. There are two sources of nondeterminism 
FSSHIM must eliminate: the latency of each operation, 
and the number of bytes operated on by the read and 
write system calls. FSSHIM eliminates the first by de- 
ciding, a priori, that each operation will block for a fixed 
and deterministic amount of logical time. For the second, 
FSSHIM guarantees that all reads and writes operate 
on a deterministic number of bytes by always performing 
the maximum amount of work requested (up to an end- 
of-file, for reads). FSSHIM can make these guarantees 
because it performs the read and write calls on behalf 
of the DPG, using the pattern illustrated in Figure 5. 

The deterministic blocking time selected by FSSHIM 
can affect performance. For example, if FSSHIM selects 
a logical blocking time that is too low, the DPG will stall 
waiting for disk operations to complete. On the other 
hand, if FSSHIM selects a time that is too high, the call- 
ing thread will execute artificially slowly. The logical 
blocking times we chose for FSSHIM are equivalent to 
a delay of about 5 million instructions; we did not exper- 
iment heavily with this number. 

A file can exist in the deterministic file hierarchy only 
if it can be considered internal to the DPG, which is true 
when: (1) the initial contents of the file are deterministic; 
(2) the file is not written by any threads outside the DPG; 
and (3) operations on that file complete in a finite time. In 
practice, the third assumption implies fail-stop. FSSHIM 
relies on the user to explicitly indicate the parts of the 
filesystem for which these assumptions are valid. This 
typically includes the directories containing program in- 
puts, as well as directories shared system-wide that are 
rarely updated, such as /usr and /etc. 
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5.2 Record/Replay 


RECSHIM records all external nondeterminism intro- 
duced through the system call interface and signals, en- 
abling deterministic program replay. RECSHIM needs to 
record only the external nondeterminism because DPGs 
eliminate all forms of internal nondeterminism. Further, 
RECSHIM can be combined with FSSHIM, reducing 
what needs to be logged since accesses to files within 
the deterministic file hierarchy would be deterministic. 

RECSHIM utilizes the shim layer to interpose on sys- 
tem calls and to intercept external signals. System calls 
that touch user-space memory are executed by RECSHIM 
on behalf of the DPG. RECSHIM produces a log con- 
taining the logical time the event occurred and any other 
event-specific information needed during replay. For sys- 
tem calls, this includes the return value and logical block- 
ing time, as well as any side-effects of the system call, 
such as the contents of a buffer after performing a socket 
read. For signals, a copy of the siginfo is saved. Logs 
are compressed on-the-fly with zlib. 

We have implemented a proof-of-concept replay shim 
to verify that the shim layer offers all the hooks necessary 
to implement a replay component. The major challenges 
in faithfully replaying system call traces are orthogonal 
to the main body of our work and have been explored by 
prior work [19, 36, 37]. 


5.3. Replicated Execution 


REPLICASHIM supports replication of a multithreaded 
webserver running inside a DPG by guaranteeing that 
the order of messages and their logical arrival time is 
kept consistent across all replicas. Given the same inputs 
and the same logical arrival times, the DPG abstraction 
guarantees that all replicas will evolve deterministically. 

Our target application is nullhttpd [12], a small, 
simple, multithreaded webserver that uses a thread-per- 
request model. Our design splits the functionality of the 
basic server into three separate process types: a single 
arbiter process, and a set of replicas, each composed of a 
shim process along with a DPG that hosts nullhttpd. 

The arbiter process operates nondeterministically, 
outside of any DPG, and accepts incoming HTTP re- 
quests from the network. The arbiter broadcasts requests 
to the replicated shims, which queue the requests locally. 
We modified nullhttpd to read new requests by mak- 
ing a direct call to its shim via dpg_callshim, rather 
than reading from the network. This shows a case where 
the programmer defines the interface via which nonde- 
terministic inputs are received. 

When the arbiter broadcasts a request, it must ensure 
that all replicas see that request at the same logical time. 
It does this by performing a two-phase commit to deter- 
mine a logical time that no replica has advanced beyond. 
The protocol works as follows. When the arbiter receives 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


a new HTTP request from the network, it asks all repli- 
cated shims to set a barrier and report their current logi- 
cal time. The arbiter uses the maximum value reported by 
any replica as the logical arrival time of the new request. 
The arbiter then broadcasts the new request and asks each 
replica’s shim to set a second barrier for this arrival time; 
once this barrier is reached at a replica, the replica’s shim 
makes the new request available to nullhttpd and the 
replicas continue to evolve deterministically. 


6. Evaluation 


The goal of our evaluation is to understand the perfor- 
mance of DPGs in comparison to ordinary nondetermin- 
istic execution (Nondet). We include evaluations of the 
three shim programs we built, namely FSSHIM, REC- 
SHIM, and REPLICASHIM. 


Correctness We tested our dOS implementation by run- 
ning the racey [20, 45] deterministic stress test 500 
times and verifying that racey always produces the same 
output. In addition to the basic racey program, we tested 
racey variants that exercise the various components of 
our implementation, such as communication via pipes, 
signals, and multiprocess shared-memory. 


Workloads We evaluated the following parallel work- 
loads: the PARSEC [8] and SPLASH2 [44] benchmark 
suites; pbzip2 [18] to compress a Linux ISO image; 
and make -j to perform a parallel build of the Linux 
kernel. The PARSEC and SPLASH2 are workloads opti- 
mized for parallelism; we scaled their inputs to run for 
about a minute with a single nondeterministic thread. 
We present a representative subset of the PARSEC and 
SPLASH2 benchmarks that was selected to showcase both 
the best-case and worst-case performance of dOS. 

We also evaluated three reactive applications: the 
Apache and nullhttpd webservers and the Chromium 
web browser. Apache and Chromium are especially in- 
teresting because they use multiple processes with mul- 
tiple threads per process. We evaluated the webservers 
using httperf [26] to simulate a constant stream of re- 
quests for static pages. We evaluated Chromium with 
two experiments: first, we measured the load time of 
nytimes.com (without any local caching); and second, 
we used Chromium’s debugging facilities to execute a 
scripted user session that opened 5 tabs and navigated 
to 12 URLs in rapid succession. All Chromium experi- 
ments used the process-per-tab model [34]. 

We ran our experiments on 8-core 2.8GHz Intel Xeon 
E5462 machines with 10GB of RAM using rundet, a 
small utility that constructs a single DPG and then exe- 
cutes an unmodified application binary inside that DPG. 
We used a relatively aggressive machine configuration to 
adequately explore the scalability of our parallel work- 
loads. All results shown are the average over ten execu- 
tions, with the highest and lowest values removed. 
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Config Throughput 

Num Threads DPG DPG+ 
Benchmark Proc perProc | Nondet only FSSHIM 
apache 10KB 16 1 10.1K 3.6K 1.7K req/s 
apache 10KB 4 4 10.1K 6.6K 2.2K req/s 
apache 10KB 1 16 10.2K 7.4K 2.4K req/s 
apache 100KB 4 8 11K 1.1K 0.9K reqs/s 
nullhttpd 1OKB 1 16 10K 1.0K _ 1.0K req/s 
chromium nytimes 18s 24s 3.85 
chromium scripted 22s 37s 40s 














Table 2. Reactive Workload Evaluation 











Overheads Speedup 
(relative to Nondet) (8-th over 2-th) 
DPG Only DPG + FsSHIM 

Benchmark | 2-th 4-th 8-th | 2-th 4+th 8-th | Nondet FSSHIM 
blackscholes | 1.2 1.2 13/12 13 13 3.4 3.2 
dedup 23 36 40/40 58 64 1.6 1.0 
fmm 26 61 101/26 60 10.1 2.4 0.6 
lu 20 23 23/20 2.3 23 2.1 1.7 
pbzip2 20 2.7 30/21 28 34 2.6 1.6 
make 2.3 41 59 | 32 5.7 82 2.8 11 

















Table 3. Parallel Workload Evaluation 


6.1 DPG Overheads 


We start with two questions: what are the overheads of 
DPGs for typical workloads, relative to nondeterministic 
execution, and how much overhead is added by FSS HIM? 
To answer these, we ran our workloads in DPGs with no 
shim attached and in DPGs with FSSHIM attached. 

Table 2 summarizes this evaluation for reactive work- 
loads. The first few rows evaluate Apache for workloads 
of 10KB and 100KB static pages. For the 1OOKB work- 
load, both the Nondet and the DPG-only case are able 
to saturate the gigabit network of the Apache server, in 
spite of the extra overhead of using the DPG. FSSHIM 
adds some additional overhead, enough to shift the sys- 
tem bottleneck to the CPU. 

For the 1OKB workload, the Nondet case is still able 
to saturate the network link. However, this workload 
involves a significantly higher rate of system calls and 
other nondeterministic events; each system call incurs a 
context switch from the DPG to its shim. As a result, 
both the DPG-only and the FSSHIM cases experience 
serialization and overhead that slows the request rate 
between 1.4x and 5.9x. 

Throughput generally decreases as the number of pro- 
cesses (Column 2) increases. We suspect this is because 
interprocess communication is more costly when exe- 
cuting in a DPG. Note that scaling can be achieved by 
running multiple smaller instances of Apache in separate 
DPGs. Overall, we consider these throughputs reason- 
able for all but the most high-traffic web sites. 

The last two rows show the execution time of Chro- 
mium. For the scripted session, latency increases by 1.7x 
for DPGs alone and by 1.8x for DPGs with FSSHIM. 
Latency increases from 1.8 seconds to just 2.4 seconds 
when loading nytimes.com. We also performed this test 
for a Google search results page (not shown). All execu- 



































Config Exec Breakdown Serialization 
Num Num | % Serial % Single Reasons 

Benchmark Proc Thread | Mode Stepping | % Pgfault % Syscall 
apache 10KB 16 1 12% < 1% < 1% 99% 
apache 10KB 4 4 80% < 1% < 1% 99% 
apache 10KB 1 16 82% < 1% < 1% 99% 
apache 100KB 4 8 26% < 1% 2% 98% 
nullhttpd 1OKB | 1 16 11% 0% 2% 98% 
chromium nytimes 58% 13% 61% 39% 
chromium scripted 25% 13% 72% 28% 
blackscholes 1 8 3% 27% 99% 1% 
dedup 1 8 54% 12% 771% 23% 
fmm 1 8 90% 18% 100% 0% 
lu 1 8 45% 35% 95% 5% 
pbzip2 1 8 35% 39% 100% 0% 
make 8 1 79% 3% 0% 100% 




















Table 4. DPG Execution Characterization 


tion times in the Google search results test were less than 
a second, and informally, the differences “felt” negligible 
when we interacted with the browser. 

Table 3 shows execution overheads for our parallel 
workloads with 2, 4, and 8 threads. Overheads are gen- 
erally below 3x, often lower than 2.5. Columns 5-7 
show the added cost of FSSHIM, which is typically small, 
since most of the applications do not perform a signifi- 
cant number of system calls (except dedup and make). 

DPG scalability is closer to Nondet scalability when 
the overheads do not grow much with the number of 
threads. Scalability suffers for workloads like fmm that 
share frequently at finer than page-level granularity, but 
blackscholes, which does not have fine-grained shar- 
ing, has DPG scalability very close to Nondet. 


Characterization Table 4 characterizes execution with 
DPGs with FSSHIM attached. Column 4 shows the frac- 
tion of time execution was serialized (i.e., in serial 
mode). As expected, for the parallel workloads, serializa- 
tion is highly correlated with overheads and scalability. 
blackscholes and fmm are good comparison points; 
blackscholes is 3% serialized and scales nearly ide- 
ally with DPGs, while fmm is 90% serialized and has 
poor scalability. For the reactive workloads, the rela- 
tionship between serialization and performance is less 
clear, as shim context switch overhead and quantum im- 
balance are also important factors. The rightmost set of 
columns show the reason for serialization, broken down 
into ownership page faults (Column 6) and system calls 
(Column 7). In reactive workloads, most serialization 
happens due to system calls, which is expected because 
reactive workloads perform frequent I/O. Conversely, 
for parallel workloads (except make), most serialization 
is due to ownership page faults. Also, the fact that dOS 
uses page-level ownership tracking can lead to unneces- 
sary serialization due to false-sharing. 

Even though serialization is very low in blackscholes, 
the overheads are still on the order of 30%, largely be- 
cause of single-stepping. Column 5 shows the fraction 
of execution during which at least one thread is single- 
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Overheads Log Sizes (per day) 

w/ w/o w/ w/o SMP-ReVirt 
Benchmark | FSSHIM FSSHIM || FSSHIM FSSHIM (from [16]) 
fmm 6.0 6.0 1.1 MB 2.0MB 83.6 GB 
lu 2.4 2.4 110MB~ 13.0MB_ 11.7GB 
ocean 3.0 3.0 1.5 MB 3.6 MB 28.1 GB 
radix 4.5 4.5 0.8 MB 2.1MB 88.7 GB 
water 48 48 5.3MB- 83.2MB_ 58.5 GB 
pbzip2 2.9 4.0 5.7MB 295.7 GB — 








Table 5. RECSHIM for Parallel Workloads (4 threads) 


stepping; this varies from 0% to 39%. One interesting 
trend is that reactive applications single-step less of- 
ten; these applications perform system calls frequently, 
which triggers an optimization to end quanta early (Sec- 
tion 4.2). Note that single-stepping does not necessar- 
ily correlate with performance because serialization and 
quantum imbalance dominate. In addition to data shown 
here, we measured the increase in frequency of total page 
fault events due to ownership changes. While the fre- 
quency is often higher, it was not directly correlated 
with performance. Serialization, quantum imbalance, 
and single-stepping are the dominant factors. 

In summary, DPG overheads are reasonable for sev- 
eral applications, including some parallel applications 
and most reactive applications. Broadly, overhead tends 
to increase with sharing, especially as the number of 
threads grows. We did not attempt to optimize appli- 
cations for more “determinism friendly” sharing, which 
could improve performance. 


Microbenchmark To more closely understand the over- 
head of intercepting system calls with a shim, we wrote 
a simple benchmark that does nothing but call getpid 
in a loop. We ran this benchmark both in a DPG without 
a shim, and in a DPG with a “null-shim.” The null-shim 
configuration ran 5x slower, suggesting that dOS im- 
poses an overhead of 5x on system call entry. 


6.2 RECSHIM: Execution Recorder Shim 


We next evaluated the overhead of using RECSHIM, and 
its resulting log sizes. Table 5 characterizes RECSHIM 
for parallel workloads. Columns 2-3 show the overheads 
for RECSHIM with and without a deterministic file hier- 
archy, respectively. These overheads are essentially iden- 
tical to execution without RECSHIM (Table 3). Columns 
4-5 show log sizes for a full day of execution. REC- 
SHIM’s log sizes are very small because DPGs eliminate 
internal nondeterminism; the remaining nondeterminism 
is due to a few system calls such as gettimeofday. 
Not making filesystem accesses deterministic (Col- 
umn 5) increases the sources of nondeterminism, lead- 
ing to larger logs. This is especially true for pbzip2, 
which must log the entire ISO image. These log sizes, 
however, are still orders of magnitude lower than the 
sizes reported by SMP-ReVirt [16] (Column 6). This 
is because SMP-ReVirt needs to record internal non- 
determinism (again, especially shared-memory), which 
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Config Throughput Log Sizes 
Num_ Threads w/ w/o w/ 

Benchmark Proc per Proc |} FSSHIM  FSSHIM FSSHIM 
apache 1|OKB | 16 1 1.7K req/s 1.6K req/s || 48.6 B/req 
apache 1|OKB| 4 4 2.3K req/s 2.1K req/s |} 51.3 B/req 
apache 10KB 1 16 2.2K req/s 2.2K req/s || 50.4 B/req 
chromium nytimes 4.25 3.9s 600 KB 
chromium scripted 40s 43s 3.3 MB 








Table 6. RECSHIM for Reactive Workloads 

















Config Throughput 
lreplica 2 replicas 
Nondet 386 req/s 373 req/s 
REPLICASHIM | 369req/s 372 req/s 








Table 7. Replicated Execution Overheads 


can be massive. Since SMP-ReVirt is a hypervisor, it 
logs nondeterminism internal to the OS, adding overhead 
for radix that a process-level implementation of SMP- 
ReVirt might be able to avoid. This is also an indication 
that determinism enforcement at the hypervisor level is 
likely to have a higher performance cost than when en- 
forced at the process level. 

Table 6 shows overheads and log sizes for RECSHIM 
when running reactive applications. Columns 4-5 show 
the throughput while recording, both with and without 
FSSHIM enabled. With FSSHIM enabled, RECSHIM did 
not reduce the throughput of the webservers from the 
results shown in Table 2. However, disabling FSSHIM 
resulted in a small performance decrease. The decreased 
performance is due to the overhead of logging additional 
input. The overheads for Chromium are about the same 
as those seen in Table 2. Column 6 shows log sizes 
normalized to the number of requests for the Apache 
runs, as well as total log sizes for Chromium sessions. 


6.3. REPLICASHIM: Replicated Execution Shim 


We end our evaluation by investigating whether we can 
we quickly build on DPGs to enable replication of an ex- 
isting multithreaded application. To answer this, we built 
and tested REPLICASHIM, which replicates our modified 
nullhttpd. For a performance comparison, we also ran 
replicas outside DPGs but still using the same arbiter and 
replication protocol (Nondet). This configuration does 
not provide any deterministic guarantees. Table 7 shows 
the throughput for | and 2 replicas with 16 threads per 
replica; each replica ran on a separate machine while the 
arbiter ran on a third machine. In both cases, the through- 
put is essentially matched. Note we did not spend much 
time optimizing the arbiter or its simplistic protocol, as 
REPLICASHIM is only as a proof-of-concept; the arbiter 
is the major bottleneck in these experiments. 


6.4 Summary 


Our evaluation illuminated the impact of determinism 
on application performance and scalability. Workload 
is fundamental factor: applications with frequent inter- 
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thread or inter-process sharing will encounter more over- 
head and worse scalability when executed deterministi- 
cally, since this communication must be tracked and con- 
trolled. Implementation choices also have a large impact. 
We suspect that much of the overhead in dOS is not fun- 
damental and might be mitigated by using sharing-aware 
memory allocation, by fine-tuning integration with the 
Linux scheduler, or by using potential upcoming hard- 
ware support for transactional memory [2]. 

The choice of deterministic execution algorithm is 
another factor. Algorithms like DMP-O that provide a 
strict memory model or make heavy use of barriers will 
likely perform worse than those that that loosen the mem- 
ory model or rely on alternative mechanisms such as 
speculation. dOS could have implemented the DMp-TM 
and DMP-B algorithms we developed in earlier work [6, 
14]. Both algorithms have better demonstrated scalabil- 
ity than DMP-O and can both be implemented at the ker- 
nel level, but both algorithms are more complex. The key 
idea of DMP-B is to relax the memory model by using a 
store buffer, which allows concurrent writes in the same 
quantum round and therefore improves scalability. The 
key idea of DMP-TM is to use transactional memory to 
speculate that each quantum round is conflict-free and 
thus can be executed in completely parallel. 


7. Related Work 


Deterministic Execution There are a few recent pro- 
posals for removing internal nondeterminism in multi- 
threaded execution. DMP [14] is a hardware proposal 
that includes two approaches for deterministic execution: 
DmpP-O uses ownership tracking at a cache-line granu- 
larity; DMP-TM uses transactional memory [33] to fur- 
ther reduce the cost of determinism by speculating that 
there is no communication between threads. Kendo [29] 
proposes a software-only library that provides a set of 
deterministic synchronization operations that offer some 
deterministic guarantees for race-free programs. Core- 
Det [6] proposed DMP-B and used compiler and run- 
time system to provide determinism for arbitrary C/C++ 
programs. Grace [7] uses speculative execution to pro- 
vide determinism for fork-join parallel programs. These 
proposals all describe algorithms for execution-level de- 
terminism, as used by DPGs. Unlike these prior propos- 
als, however, DPGs support determinism beyond shared- 
memory in arbitrary binary programs and also provide a 
way to precisely control external nondeterminism. 
Another approach is language-level determinism, 
which uses a parallel language that is deterministic 
by construction, such as StreamIt [41], SHIM [17], 
NESL [10], Jade [35], or DPJ [11]. The prime trade-off 
between execution-level and language-level determinism 
is one of generality and controllability. In language-level 
determinism, the programmer must use specific language 


constructs but gets explicit control of which deterministic 
executions are possible; in execution-level determinism 
the programmer can use any language (i.e., determinism 
is fully transparent) but cannot control which determin- 
istic executions will happen, making behavior less pre- 
dictable at program construction time. While determinis- 
tic languages are a promising long-term solution, the ma- 
jority of today’s programs are written in mainstream lan- 
guages such as C++ or Java, and this will likely remain 
the case for the foreseeable future. Additionally, parallel 
languages are often domain-specific and not well suited 
to general purpose, reactive applications; in contrast, we 
have used dOS to demonstrate how reactive applications 
can benefit from execution-level determinism. 
Determinator [3] proposes to enforce determinism us- 
ing a custom microkernel. Like dOS, Determinator sup- 
ports multiple processes and uses page protection to en- 
force determinism of shared-memory accesses. Determi- 
nator supports both standard pthreads programs, via an 
implementation of DMP-B, as well as programs written 
using specialized parallel programming constructs that 
are designed to be deterministic. Unlike dOS, however, 
Determinator does not explore the separation between 
internal and external nondeterminism, and further, Deter- 
minator has no equivalent of the DPG shim layer inter- 
face for precisely controlling external nondeterminism. 


Record/Replay Record and replay is a natural way to 
cope with internal nondeterminism during debugging. 
There are many proposals for software-based implemen- 
tations of record and replay. Some record all shared 
accesses that lead to communication [22]; others as- 
sume uniprocessor execution and record only schedul- 
ing decisions [13]; others record only synchronization 
operations [36]. The high overheads of logging shared- 
memory communication motivated several proposals 
for hardware-supported recording [24, 45, 46], includ- 
ing some recent OS work on virtualization of hardware 
mechanisms for recording [25]. 

More recent work [1, 31, 47] relaxes the guarantees 
of replay by recording just a subset of the information 
required for faithful deterministic replay. The result is a 
smaller log at the cost of requiring a potentially impracti- 
cal search of the execution space during replay. ESD [48] 
uses symbolic execution to reconstruct thread schedules 
given only a core dump, without requiring any execution 
logs to begin with. Unfortunately, ESD suffers from the 
incompleteness problems faced by symbolic execution, 
and thus cannot guarantee that a suitable execution will 
be found during replay. 

Two recent and notable record/replay systems are 
SMP-ReVirt [16] and Scribe [21]. Both systems use page 
ownership similarly to dOS but record ownership transi- 
tions rather than imposing a single deterministic order, 
as in dOS. SMP-ReVirt is a hypervisor, and so it sup- 
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ports full-system replay only, while Scribe is a kernel ex- 
tension, allowing it to support replay of process groups 
much like dOS. Additionally, Scribe and dOS use similar 
strategies to track ownership changes of kernel objects. 

In contrast to all record/replay systems, the determin- 
ism guaranteed by DPGs enables precise replay without 
needing to record any internal nondeterminism. 


Replicated Execution Most prior work in multithread- 
ed replicas has taken the approach of recording and repli- 
cating internal nondeterminism. Examples include sys- 
tems that assume a uniprocessor [28, 40]; that assume 
race-freedom [4, 5]; and that conservatively replicate 
all potential shared-memory nondeterminism [39]. Re- 
cently, Replicant [32] proposed a limited form of de- 
terministic execution specifically for the purpose of de- 
terministic replication, but this approach requires pro- 
grammer annotations. Most recently, Respec [23] exe- 
cutes replicas independently while periodically verifying 
consistency; when consistency is violated, replicas are 
rolled back to a consistent state and execution proceeds 
more conservatively. Respec does not support replication 
across more than one machine, limiting its usefulness. 
In contrast to prior systems, the determinism offered by 
DPGs naturally enables replication. 

There are some parallels between how dOS provides 
deterministic execution within a process group and how 
toolkits like Isis [9] and Horus [42] provide virtually syn- 
chronous execution to a distributed process group. Isis 
provides totally ordered multicast primitives that guaran- 
tee all processes see messages in the same order, a pow- 
erful building block for consistent updates of distributed 
replicas; dOS implements DMP-O to enforce a determin- 
istic order on both implicit shared-memory and explicit 
OS-channel communications between threads and pro- 
cesses. Unlike dOS, Isis does not guarantee the deter- 
ministic execution of a process or the deterministic tim- 
ing of message delivery relative to processes’ instruction 
sequence. Unlike Isis, dOS does not provide fault tol- 
erance, distributed group membership services, or state 
transfer to new group members. 


8. Conclusions 


We introduced the DPG abstraction, which allows pro- 
grammers to define a deterministic box inside which all 
communication happens deterministically. We described 
the shim layer, an interface through which external non- 
determinism can be observed and controlled by user- 
space programs. We developed dOS, an implementation 
of DPGs in Linux. We demonstrated the shim layer with 
three applications: record/replay, multithreaded replica- 
tion, and deterministic filesystem services. 

Our evaluation showed that DPGs have reasonable 
cost in reactive applications such as Apache and Chro- 
mium, and also in several parallel workloads. This con- 
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ceivably enables deterministic execution in deployment, 
which would fully leverage the benefits of determinism 
in testing, reliability and debugging. 
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Abstract 


Deterministic execution offers many benefits for debug- 
ging, fault tolerance, and security. Current methods 
of executing parallel programs deterministically, how- 
ever, often incur high costs, allow misbehaved software 
to defeat repeatability, and transform time-dependent 
races into input- or path-dependent races without elim- 
inating them. We introduce a new parallel program- 
ming model addressing these issues, and use Determina- 
tor, a proof-of-concept OS, to demonstrate the model’s 
practicality. Determinator’s microkernel API provides 
only “shared-nothing” address spaces and determinis- 
tic interprocess communication primitives to make ex- 
ecution of all unprivileged code—well-behaved or not— 
precisely repeatable. Atop this microkernel, Determi- 
nator’s user-level runtime adapts optimistic replication 
techniques to offer a private workspace model for both 
thread-level and process-level parallel programing. This 
model avoids the introduction of read/write data races, 
and converts write/write races into reliably-detected con- 
flicts. Coarse-grained parallel benchmarks perform and 
scale comparably to nondeterministic systems, on both 
multicore PCs and across nodes in a distributed cluster. 


1 Introduction 


We often wish to run software deterministically, so that 
from a given input it always produces the same out- 
put. Determinism is the foundation of replay debug- 
ging [37, 39, 46, 56], fault tolerance [15, 18,50], and ac- 
countability mechanisms [30, 31]. Methods of intrusion 
analysis [22, 34] and timing channel control [4] further 
assume the system can enforce determinism even on ma- 
licious code designed to evade analysis. Executing par- 
allel software deterministically is challenging, however, 
because threads sharing an address space—or processes 
sharing resources such as file systems—are prone to non- 
deterministic, timing-dependent races [3, 40, 42,43]. 
User-space techniques for parallel deterministic exe- 
cution [8, 10, 20, 21,44] show promise but have limi- 
tations. First, by relying on a deterministic scheduler 
residing in the application process, they permit buggy 
or malicious applications to compromise determinism 
by interfering with the scheduler. Second, determinis- 
tic schedulers emulate conventional APIs by synthesiz- 
ing a repeatable—but arbitrary—schedule of inter-thread 
interactions, often using an instruction counter as an arti- 
ficial time metric. Data races remain, therefore, but their 


manifestation depends subtly on inputs and code path 
lengths instead of on “real” time. Third, the user-level 
instrumentation required to isolate and schedule threads’ 
memory accesses can incur considerable overhead, even 
on coarse-grained code that synchronizes rarely. 


To meet the software development, debugging, and 
security challenges that ubiquitous parallelism presents, 
it may be insufficient to shoehorn the standard nonde- 
terministic programming model into a synthetic execu- 
tion schedule. Instead we propose to rethink the basic 
model itself. We would like a parallel environment that: 
(a) is “deterministic by default” [12,40], except when 
we inject nondeterminism explicitly via external inputs; 
(b) introduces no data races, either at the memory ac- 
cess level [25,43] or at higher semantic levels [3]; (c) 
can enforce determinism on arbitrary, compromised or 
malicious code for security reasons; and (d) is efficient 
enough to use for “normal-case” execution of deployed 
code, not just for instrumentation during development. 


As a step toward such a model, we present Determi- 
nator, a proof-of-concept OS designed around the above 
goals. Due to its OS-level approach, Determinator sup- 
ports existing languages, can enforce deterministic exe- 
cution not only on a single process but on groups of in- 
teracting processes, and can prevent malicious user-level 
code from subverting the kernel’s guarantee of determin- 
ism. In order to explore the design space freely, Determi- 
nator takes a “clean-slate” approach, making few com- 
promises for backward compatibility with existing ker- 
nels or APIs. Determinator’s programming model could 
be implemented in a legacy kernel for backward compat- 
ibility, however, as part of a “deterministic sandbox” for 
example [9]. Determinator’s user-level runtime also pro- 
vides limited emulation of the Unix process, thread, and 
file APIs, to simplify application porting. 


Determinator’s kernel enforces determinism by deny- 
ing user code direct access to hardware resources whose 
use can yield nondeterministic behavior, including real- 
time clocks, cycle counters, and writable shared memory. 
Determinator constrains user code to run within a hierar- 
chy of single-threaded, process-like spaces, each having 
a private virtual address space. The kernel’s low-level 
API provides only three system calls, with which a space 
can synchronize and communicate with its immediate 
parent and children. Potentially useful sources of non- 
determinism, such as timers, Determinator encapsulates 
into I/O devices, which unprivileged spaces can access 
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only via explicit communication with more privileged 
spaces. A supervisory space can thus mediate all non- 
deterministic inputs affecting a subtree of unprivileged 
spaces, logging true nondeterministic events for future 
replay or synthesizing artificial events, for example. 

Atop this minimal kernel API, Determinator’s user- 
level runtime emulates familiar shared-resource pro- 
gramming abstractions. The runtime employs file repli- 
cation and versioning [47] to offer applications a logi- 
cally shared file system accessed via the Unix file API, 
and adapts distributed shared memory [2, 17] to emulate 
shared memory for multithreaded applications. Since 
this emulation is implemented in user space, applications 
can freely customize it, and runtime bugs cannot com- 
promise the kernel’s guarantee of determinism. 

Rather than strictly emulating a conventional, nonde- 
terministic API and consistency model like determinis- 
tic schedulers do [8—-10, 21, 44], Determinator explores 
a novel private workspace model. In this model, each 
thread keeps a private virtual replica of all shared mem- 
ory and file system state; normal reads and writes access 
and modify this working copy. Threads reconcile their 
changes only at program-defined synchronization points, 
much as developers use version control systems. This 
model eliminates read/write data races, because reads see 
only causally prior writes in the explicit synchronization 
graph, and write/write races become conflicts, which the 
runtime reliably detects and reports independently of any 
(real or synthetic) execution schedule. 

Experiments with common parallel benchmarks sug- 
gest that Determinator can run coarse-grained paral- 
lel applications deterministically with both performance 
and scalability comparable to nondeterministic environ- 
ments. Determinism incurs a high cost on fine-grained 
parallel applications, however, due to Determinator’s use 
of virtual memory to isolate threads. For “embarrass- 
ingly parallel” applications requiring little inter-thread 
communication, Determinator can distribute the com- 
putation across nodes in a cluster mostly transparently 
to the application, maintaining usable performance and 
scalability. As a proof-of-concept, however, the cur- 
rent prototype has many limitations, such as a restric- 
tive space hierarchy, limited file system size, no persis- 
tent storage, and inefficient cross-node communication. 

This paper makes four main contributions. — First, 
we present the first OS designed from the ground 
up to offer system-enforced deterministic execution, 
for both multithreaded processes and groups of in- 
teracting processes. Second, we introduce a private 
workspace model for deterministic parallel program- 
ming, which eliminates read/write data races and con- 
verts schedule-dependent write/write races into reliably- 
detected, schedule-independent conflicts. Third, we use 
this model to emulate shared memory and file system ab- 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


stractions in Determinator’s user-space runtime. Fourth, 
we demonstrate experimentally that this model is practi- 
cal and efficient enough for “normal-case’” use, at least 
for coarse-grained parallel applications. 

Section 2 outlines the deterministic programming 
model we seek to create. Section 3 then describes the 
Determinator kernel’s design and API, and Section 4 de- 
tails its user-space application runtime. Section 5 exam- 
ines our prototype implementation, and Section 6 evalu- 
ates it informally and experimentally. Finally, Section 7 
outlines related work, and Section 8 concludes. 


2 A Deterministic Programming Model 


Determinator’s basic goal is to offer a programming 
model that is naturally and pervasively deterministic. To 
be naturally deterministic, the model’s basic abstractions 
should avoid introducing data races or other nondeter- 
ministic behavior in the first place, and not merely pro- 
vide ways to control, detect, or reproduce races. To be 
pervasively deterministic, the model should behave de- 
terministically at all levels of abstraction: e.g., for shared 
memory access, inter-thread synchronization, file system 
access, inter-process communication, external device or 
network access, and thread/process scheduling. 

Intermediate design points are possible and may yield 
useful tradeoffs. Enforcing determinism only on syn- 
chronization and not on low-level memory access might 
improve efficiency, for example, as in Kendo [44]. For 
now, however, we explore whether a “purist” approach 
to pervasive determinism is feasible and practical. 

To achieve this goal, we must address timing depen- 
dencies in at least four aspects of current systems: in 
way applications obtain semantically-relevant nondeter- 
ministic inputs they require for operation; in shared state 
such as memory and file systems; in the synchroniza- 
tion APIs threads and processes use to coordinate; and 
in the namespaces with which applications use and man- 
age system resources. We make no claim that these are 
the only areas in which current operating systems intro- 
duce nondeterminism, but they are the aspects we found 
essential to address in order to build a working, perva- 
sively deterministic OS. We discuss each area in turn. 


2.1. Explicit Nondeterministic Inputs 


Many applications use nondeterministic inputs, such as 
incoming messages for a web server, timers for an in- 
teractive or real-time application, and random numbers 
for a cryptographic algorithm. We seek not to eliminate 
application-relevant nondeterministic inputs, but to make 
such inputs explicit and controllable. 

Mechanisms for parallel debugging [39, 46, 56], fault 
tolerance [15, 18,50], accountability [30,31], and intru- 
sion analysis [22, 34] all rely on the ability to replay a 
computation instruction-for-instruction, in order to repli- 
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cate, verify, or analyze a program’s execution history. 
Replay can be efficient when only I/O need be logged, 
as for a uniprocessor virtual machine [22], but becomes 
much more costly if internal sources of nondeterminism 
due to parallelism must also be replayed [19, 23]. 

Determinator therefore transforms useful sources of 
nondeterminism into explicit I/O, which applications 
may obtain via controllable channels, and eliminates 
only internal nondeterminism resulting from parallelism. 
If an application calls gett imeofday (), for example, 
then a supervising process can intercept this I/O to log, 
replay, or synthesize these explicit time inputs. 


2.2 A Race-Free Model for Shared State 


Conventional systems give threads direct, concurrent ac- 
cess to many forms of shared state, such as shared mem- 
ory and file systems, yielding data races and heisenbugs 
if the threads fail to synchronize properly [25, 40, 43]. 
While replay debuggers [37, 39,46, 56] and deterministic 
schedulers [8, 10,20,21,44] make data races reproducible 
once they manifest, they do not change the inherently 
race-prone model in which developers write applications. 

Determinator replaces the standard concurrent access 
model with a private workspace model, in which data 
races do not arise in the first place. This model gives 
each thread a complete, private virtual replica of all log- 
ically shared state a thread may access, including shared 
memory and file system state. A thread’s normal reads 
and writes affect only its private working state, and do 
not interact directly with other threads. Instead, Deter- 
minator accumulates each threads’s changes to shared 
state, then reconciles these changes among threads only 
at program-defined synchronization points. This model 
is related to and inspired by early parallel Fortran sys- 
tems [7,51], replicated file systems [47], transactional 
memory [33,52] and operating systems [48], and dis- 
tributed version control systems [29], but to our knowl- 
edge Determinator is the first OS to introduce a model 
for pervasive thread- and process-level determinism. 

If one thread executes the assignment “x = y’ while 
another concurrently executes ‘y = x’, for example, 
these assignments race in the conventional model, but are 
race-free under Determinator and always swap z with y. 
Each thread’s read of x or y always sees the “old” version 
of that variable, saved in the thread’s private workspace 
at the last explicit synchronization point. 

Figure | illustrates a more realistic example of a game 
or simulator, which uses an array of “actors” (players, 
particles, etc.) to represent some logical “universe,” and 
updates all of the actors in parallel at each time step. To 
update the actors, the main thread forks a child thread to 
process each actor, then synchronizes by joining all these 
child threads. The child thread code to update each ac- 
tor is shown “inline” within the main() function, which 


struct actor_state actor[nactors]; 


main() 
initialize all elements of actor[] array 
for (time = 0; ; time++) 
for (i = 0; i < nactors; i++) 
if (thread_fork(i) == IN-CHILD) 
// child thread to process actor[i] 
examine state of nearby actors 
update state of actor[i] accordingly 
thread _exit(); 
for (i = 0; i < nactors; i++) 
thread_join(:); 


Figure 1: C pseudocode for lock-step time simulation, 
which contains a data race in standard concurrency mod- 
els but is bug-free under Determinator. 


under Unix works only with process-level fork () ; De- 
terminator offers this convenience for shared memory 
threads as well, as discussed later in Section 4.4. 

In this example, each child thread reads the “prior” 
state of any or all actors in the array, then updates the 
state of its assigned actor “in-place,” without any explicit 
copying or additional synchronization. With standard 
threads this code has a read/write race: each child thread 
may see an arbitrary mix of “old” and “new” states as 
it examines other actors in the array. Under Determi- 
nator, however, this code is correct and race-free. Each 
child thread reads only its private working copy of the 
actors array, which is untouched (except by the child 
thread itself) since the main thread forked that child. As 
the main thread rejoins all its child threads, Determina- 
tor merges each child’s actor array updates back into the 
main thread’s working copy, for use in the next time step. 

While read/write races disappear in Determinator’s 
model, traditional write/write races become conflicts. If 
two child threads concurrently write to the same actor 
array element, for example, Determinator detects this 
conflict and signals a runtime exception when the main 
thread attempts to join the second conflicting child. In 
the conventional model, by contrast, the threads’ execu- 
tion schedules might cause either of the two writes to 
“win” and silently propagate its likely erroneous value 
throughout the computation. Running this code under 
a conventional deterministic scheduler causes the “win- 
ner’ to be decided based on a synthetic, reproducible 
time metric (e.g., instruction count) rather than real time, 
but the race remains and may still manifest or vanish due 
to slight changes in inputs or instruction path lengths. 


2.3 A Race-Free Synchronization API 


Conventional threads can still behave nondeterministi- 
cally even in a correctly locked program with no low- 
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level data races. Two threads might acquire a lock in any 
order, for example, leading to high-level data races [3]. 
This source of nondeterminism is inherent in the lock ab- 
straction: we can record and replay or synthesize a lock 
acquisition schedule [44], but such a schedule is still ar- 
bitrary and effectively unpredictable to the developer. 
Fortunately, many other synchronization abstractions 
are naturally deterministic, such as fork/join, barriers, 
and futures [32]. Deterministic abstractions have the key 
property that when threads synchronize, program logic 
alone determines at what points in the threads’ execu- 
tion paths the synchronization occurs, and which threads 
are involved. In fork/join synchronization, for exam- 
ple, the parent’s thread_join(t) operation and the child’s 
thread_exit() determine the respective synchronization 
points, and the parent indicates explicitly the thread ¢ to 
join. Locks fail this test because one thread’s unlock() 
passes the lock to an arbitrary successor thread’s lock(). 
Queue abstractions such as semaphores and pipes are de- 
terministic if only one thread can access each end of the 
queue [24, 36], but nondeterministic if several threads 
can race to insert or remove elements at either end. A 
related draft elaborates on these considerations [5]. 
Since the multicore revolution is young and most ap- 
plication code is yet to be parallelized, we may still have 
a choice of what synchronization abstractions to use. 
Determinator therefore supports only race-free synchro- 
nization primitives natively, although it can emulate non- 
deterministic primitives via deterministic scheduling for 
compatibility, as described later in Section 4.5. 


2.4 Race-Free System Namespaces 


Current operating system APIs often introduce nondeter- 
minism unintentionally by exposing shared namespaces 
implicitly synchronized by locks. Execution timing af- 
fects the pointers returned by malloc() or mmap() 
or the file numbers returned by open() in multi- 
threaded Unix processes, and the process IDs returned 
by fork () or the file names returned by mkt emp () in 
single-threaded processes. Even if only one thread actu- 
ally uses a given memory block, file, process ID, or tem- 
porary file, the assignment of these names from a shared 
namespace is inherently nondeterministic. 

Determinator’s API therefore avoids creating shared 
namespaces with system-chosen names, instead favor- 
ing thread-private namespaces with application-chosen 
names. Application code, not the system, decides where 
to allocate memory and what process IDs to assign chil- 
dren. This principle ensures that naming a resource re- 
veals no shared state information other than what the ap- 
plication itself provided. Since implicitly shared names- 
paces often cause multiprocessor contention, designing 
system APIs to avoid this implicit sharing may be syner- 
gistic with recent multicore scalability work [14]. 
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Figure 2: The kernel’s hierarchy of spaces, each contain- 
ing private register and virtual memory state. 


3 The Determinator Kernel 


Having outlined the principles underlying Determina- 
tor’s programming model, we now describe its kernel 
design. Normal applications do not use the kernel API 
directly, but rather the higher-level abstractions provided 
by the user-level runtime, described in the next section. 
We make no claim that our kernel design or API is the 
“right” design for a determinism-enforcing kernel, but 
merely that it illustrates one way to implement a perva- 
sively deterministic application environment. 


3.1 Spaces 


Determinator executes application code within an arbi- 
trarily deep hierarchy of spaces, illustrated in Figure 2. 
Each space consists of CPU register state for a single 
control flow, and private virtual memory containing code 
and data directly accessible within that space. A De- 
terminator space is analogous to a single-threaded Unix 
process, with important differences; we use the term 
“space” to highlight these differences and avoid confu- 
sion with the “process” and “thread” abstractions Deter- 
minator emulates at user level, described in Section 4. 


As in a nested process model [27], a Determinator 
space cannot outlive its parent, and a space can directly 
interact only with its immediate parent and children via 
three system calls described below. The kernel provides 
no file systems, writable shared memory, or other ab- 
stractions that imply globally shared state. 


Only the distinguished root space has direct access to 
nondeterministic inputs via I/O devices, such as console 
input or clocks. Other spaces can access I/O devices only 
indirectly via parent/child interactions, or via I/O privi- 
leges delegated by the root space. A parent space can 
thus control all nondeterministic inputs into any unpriv- 
ileged space subtree, e.g., logging inputs for future re- 
play. This space hierarchy also creates a performance 
bottleneck for I/O-bound applications, a limitation of the 
current design we intend to address in future work. 
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Call | Interacts with | Description 





Put Child space 
Get Child space 
Ret | Parent space 


Copy register state and/or virtual memory range into child, and optionally start child executing. 
Copy register state, virtual memory range, and/or changes since the last snapshot out of a child. 
Stop and wait for parent to issue a Get or Put. Processor traps also cause implicit Ret. 


Table 1: System calls comprising Determinator’s kernel API. 


Put | Get | Option | Description 





v v Regs | PUT/GET child’s register state. 
v v Copy | Copy memory to/from child. 
v v Zero | Zero-fill virtual memory range. 
v Snap Snapshot child’s virtual memory. 
v Start Start child space executing. 

v Merge | Merge child’s changes into parent. 
v v Perm | Set memory access permissions. 
v v Tree Copy (grand)child subtree. 


Table 2: Options/arguments to the Put and Get calls. 


3.2 System Call API 


Determinator spaces interact only as a result of proces- 
sor traps and the kernel’s three system calls—Put, Get, 
and Ret, summarized in Table 1. Put and Get take sev- 
eral optional arguments, summarized in Table 2. Most 
options can be combined: e.g., in one Put call a space 
can initialize a child’s registers, copy a range of the par- 
ent’s virtual memory into the child, set page permissions 
on the destination range, save a complete snapshot of the 
child’s address space, and start the child executing. 

Each space has a private namespace of child spaces, 
which user-level code manages. A space specifies a 
child number to Get or Put, and the kernel creates that 
child if it doesn’t already exist, before performing the re- 
quested operations. If the specified child did exist and 
was still executing at the time of the Put/Get call, the 
kernel blocks the parent’s execution until the child stops 
due to a Ret system call or a processor trap. These “ren- 
dezvous” semantics ensure that spaces synchronize only 
at well-defined points in both spaces’ execution. 

The Copy option logically copies a range of virtual 
memory between the invoking space and the specified 
child. The kernel uses copy-on-write to optimize large 
copies and avoid physically copying read-only pages. 

Merge is available only on Get calls. A Merge is like a 
Copy, except the kernel copies only bytes that differ be- 
tween the child’s current and reference snapshots into the 
parent space, leaving other bytes in the parent untouched. 
The kernel also detects conflicts: if a byte changed in 
both the child’s and parent’s spaces since the snapshot, 
the kernel generates an exception, treating a conflict as 
a programming error like an illegal memory access or 
divide-by-zero. Determinator’s user-level runtime uses 
Merge to give multithreaded processes the illusion of 
shared memory, as described later in Section 4.4. In prin- 
ciple, user-level code could implement Merge itself, but 
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the kernel’s direct access to page tables makes it easy for 
the kernel to implement Merge efficiently. 

Finally, the Ret system call stops the calling space, re- 
turning control to the space’s parent. Exceptions such as 
divide-by-zero also cause a Ret, providing the parent a 
status code indicating why the child stopped. 

To facilitate debugging and prevent untrusted children 
from looping forever, a parent can start a child with an 
instruction limit, forcing control back to the parent af- 
ter the child and its descendants collectively execute this 
many instructions. Counting instructions instead of “real 
time” preserves determinism, while enabling spaces to 
“quantize” a child’s execution to implement scheduling 
schemes deterministically at user level [8,21]. 

Barring kernel or processor bugs, unprivileged spaces 
constrained to use the above kernel API alone cannot 
behave nondeterministically even by deliberate design. 
While a formal proof is out of scope, one straightforward 
argument is that the above Get/Put/Ret primitives reduce 
to blocking, one-to-one message channels, making the 
space hierarchy a deterministic Kahn network [36]. 


3.3. Distribution via Space Migration 


The kernel allows space hierarchies to span not only 
multiple CPUs in a multiprocessor/multicore system, but 
also multiple nodes in a homogeneous cluster, mostly 
transparently to application code. While distribution is 
semantically transparent to applications, an application 
may have to be designed with distribution in mind to per- 
form well. As with other aspects of the kernel’s design, 
we make no pretense that this is the “right” approach to 
cross-node distribution, but merely one way to extend a 
deterministic execution model across a cluster. 
Distribution adds no new system calls or options to 
the API above. Instead, the Determinator kernel inter- 
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prets the higher-order bits in each process’s child num- 
ber namespace as a “node number’ field. When a space 
invokes Put or Get, the kernel first logically migrates the 
calling space’s state and control flow to the node whose 
number the user specifies as part of its child number 
argument, before creating and/or interacting with some 
child on that node, as specified in the remaining child 
number bits. Figure 3 illustrates a space migrating be- 
tween two nodes and managing child spaces on each. 

Once created, a space has a home node, to which the 
space migrates when interacting with its parent on a Ret 
or trap. Nodes are numbered so that “node zero” in 
any space’s child namespace always refers to the space’s 
home node. If a space uses only the low bits in its 
child numbers and leaves the node number field zero, the 
space’s children all have the same home as the parent. 

When the kernel migrates a space, it first transfers to 
the receiving kernel only the space’s register state and 
address space summary information. Next, the receiving 
kernel requests the space’s memory pages on demand as 
the space accesses them on the new node. Each node’s 
kernel avoids redundant cross-node page copying in the 
common case when a space repeatedly migrates among 
several nodes—e.g., when a space starts children on each 
of several nodes, then returns later to collect their results. 
For pages that the migrating space only reads and never 
writes, such as program code, each kernel reuses cached 
copies of these pages whenever the space returns to that 
node. The kernel currently performs no prefetching or 
other adaptive optimizations. Its rudimentary messaging 
protocol runs directly atop Ethernet, and does not support 
TCP/IP for Internet-wide distribution. 


4 Emulating High-Level Abstractions 


Determinator’s kernel API eliminates many convenient 
and familiar abstractions; can we reproduce them un- 
der strict determinism? We find that many familiar ab- 
stractions remain feasible, though with important trade- 
offs. This section describes how Determinator’s user- 
level runtime infrastructure emulates traditional Unix 
processes, file systems, threads, and synchronization. 


4.1 Processes and fork/exec/wait 


We make no attempt to replicate Unix process se- 
mantics exactly, but would like to emulate traditional 
fork/exec/wait APIs enough to support common 
uses in scriptable shells, build tools, and multi-process 
“batch processing” applications such as compilers. 


Fork: Implementing a basic Unix fork() requires 
only one Put system call, to copy the parent’s entire 
memory state into a child space, set up the child’s regis- 
ters, and start the child. The difficulty arises from Unix’s 
global process ID (PID) namespace, a source of nonde- 
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terminism as discussed in Section 2.4. Since most ap- 
plications use PIDs returned by fork() merely as an 
opaque argument to a subsequent waitpid (), our run- 
time makes PIDs local to each process: one process’s 
PIDs are unrelated to, and may numerically conflict with, 
PIDs in other processes. This change breaks Unix appli- 
cations that pass PIDs among processes, and means that 
commands like ‘ps’ must be built into shells for the same 
reason that ‘cd’ already is. This simple approach works 
for compute-oriented applications following the typical 
fork/wait pattern, however. 

Since fork () returns a PID chosen by the system, 
while our kernel API requires user code to manage child 
numbers, our user-level runtime maintains a “free list” of 
child spaces and reserves one during each fork (). To 
emulate Unix process semantics more closely, a central 
space such as the root space could manage a global PID 
namespace, at the cost of requiring inter-space commu- 
nication during operations such as fork (). 


Exec: A user-level implementation of Unix exec () 
must construct the new program’s memory image, in- 
tended to replace the old program, while still execut- 
ing the old program’s runtime library code. Our run- 
time loads the new program into a “reserved” child space 
never used by fork(), then calls Get to copy that 
child’s entire memory atop that of the (running) parent: 
this Get thus “returns” into the new program. To ensure 
that the instruction address following the old program’s 
Get is a valid place to start the new program, the run- 
time places this Get in a small “trampoline” code frag- 
ment mapped at the same location in the old and new 
programs. The runtime also carries over some Unix pro- 
cess state, such as the the PID namespace and file system 
state described later, from the old to the new program. 


Wait: When an application calls waitpid () to wait 
for a specific child, the runtime calls Get to synchronize 
with the child’s Ret and obtain the child’s exit status. The 
child may return to the parent before terminating, in or- 
der to make I/O requests as described below; in this case, 
the parent’s runtime services the I/O request and resumes 
the waitpid() transparently to the application. 

Unix’s wait () is more challenging, as it waits for 
any (i.e., “the first”) child to terminate, violating the 
constraints of deterministic synchronization discussed in 
Section 2.3. Our kernel’s API provides no system call to 
“wait for any child,” and can’t (for unprivileged spaces) 
without compromising determinism. Instead, our run- 
time waits for the child that was forked earliest whose 
status was not yet collected. 

This behavior does not affect applications that fork one 
or more children and then wait for all of them to com- 
plete, but affects two common uses of wait (). First, 
interactive Unix shells use wait () to report when back- 
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(a) ‘make -j' (b) ‘make -j' 































































































on Unix on Determinator 

CPU 1 Task 1 Task 1 
CPU2| Task 2 Task 3 Task 2 Task 3 

Time all wait()s return Time all wait()s return 

(c) 'make -j2' on Unix (d) ‘make -j2' on Determinator 
(nondeterministic) (deterministic) 

CPU 1 Task 1 Task 1 
CPU2]| Task 2 Task 3 Task 2 Task 3 

Time Wait() returns Time wait() returns 


Figure 4: Example parallel make scheduling scenarios 
under Unix versus Determinator: (a) and (b) with unlim- 
ited parallelism (no user-level scheduling); (c) and (d) 
with a “2-worker” quota imposed at user level. 


ground processes complete; thus, an interactive shell run- 
ning under Determinator requires special “nondetermin- 
istic I/O privileges” to provide this functionality (and re- 
lated functions such as interactive job control). Second, 
our runtime’s behavior may adversely affect the perfor- 
mance of programs that use wait () to implement dy- 
namic scheduling or load balancing in user space. 
Consider a parallel make run with or without limiting 
the number of concurrent children. A plain ‘make -j’, 
allowing unlimited children, leaves scheduling decisions 
to the system. Under Unix or Determinator, the kernel’s 
scheduler dynamically assigns tasks to available CPUs, 
as illustrated in Figure 4 (a) and (b). If the user runs 
‘make -—j2’, however, then make initially starts only 
tasks 1 and 2, then waits for one of them to complete be- 
fore starting task 3. Under Unix, wait () returns when 
the short task 2 completes, enabling make to start task 3 
immediately as in (c). On Determinator, however, the 
wait () returns only when (deterministically chosen) 
task 1 completes, resulting in a non-optimal schedule (d): 
determinism prevents the runtime from learning which of 
tasks 1 and 2 completed first. The unavailability of tim- 
ing information with which to make good application- 
level scheduling decisions thus suggests a practice of 
leaving scheduling to the system in a deterministic en- 
vironment (e.g., ‘make -j’ instead of ‘-72’). 


4.2 A Shared File System 


Unix’s globally shared file system provides a convenient 
namespace and repository for staging program inputs, 
storing outputs, and holding intermediate results such as 
temporary files. Since our kernel permits no physical 
state sharing, user-level code must emulate shared state 
abstractions. Determinator’s “shared-nothing” space hi- 
erarchy is similar to a distributed system consisting only 
of uniprocessor machines, so our user-level runtime bor- 
rows distributed file system principles to offer applica- 
tions a shared file system abstraction. 
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Figure 5: Each user-level runtime maintains a private 
replica of a logically shared file system, using file ver- 
sioning to reconcile replicas at synchronization points. 


Since our current focus is on emulating familiar ab- 
stractions and not on developing storage systems, Deter- 
minator’s file system currently provides no persistence: 
it effectively serves only as a temporary file system. 


While many distributed file system designs may be ap- 
plicable, our runtime uses replication with weak consis- 
tency [53,55]. Our runtime maintains a complete file 
system replica in the address space of each process it 
manages, as shown in Figure 5. When a process cre- 
ates a child via fork (), the child inherits a copy of 
the parent’s file system in addition to the parent’s open 
file descriptors. Individual open/close/read/write 
operations in a process use only that process’s file sys- 
tem replica, so different processes’ replicas may diverge 
as they modify files concurrently. When a child termi- 
nates and its parent collects its state via wait (), the 
parent’s runtime copies the child’s file system image into 
a scratch area in the parent space and uses file version- 
ing [47] to propagate the child’s changes into the parent. 

If a shell or parallel make forks several compiler pro- 
cesses in parallel, for example, each child writes its out- 
put .o file to its own file system replica, then the par- 
ent’s runtime merges the resulting .o files into the par- 
ent’s file system as the parent collects each child’s exit 
status. This copying and reconciliation is not as ineffi- 
cient as it may appear, due to the kernel’s copy-on-write 
optimizations. Replicating a file system image among 
many spaces copies no physical pages until user-level 
code modifies them, so all processes’ copies of identical 
files consume only one set of pages. 

As in any weakly-consistent file system, processes 
may cause conflicts if they perform unsynchronized, con- 
current writes to the same file. When our runtime detects 
a conflict, it simply discards one copy and sets a con- 
flict flag on the file; subsequent attempts to open() the 
file result in errors. This behavior is intended for batch 
compute applications for which conflicts indicate an ap- 
plication or build system bug, whose appropriate solu- 
tion is to fix the bug and re-run the job. Interactive use 
would demand a conflict handling policy that avoids los- 
ing data. The user-level runtime could alternatively use 
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pessimistic locking to implement stronger consistency 
and avoid unsynchronized concurrent writes, at the cost 
of more inter-space communication. 

The current design’s placement of each process’s file 
system replica in the process’s own address space has 
two drawbacks. First, it limits total file system size to 
less than the size of an address space; this is a serious 
limitation in our 32-bit prototype, though it may be less 
of an issue on a 64-bit architecture. Second, wild pointer 
writes in a buggy process may corrupt the file system 
more easily than in Unix, where a buggy process must 
actually call write() to corrupt a file. The runtime 
could address the second issue by write-protecting the 
file system area between calls to write (), or it could 
address both issues by storing file system data in child 
spaces not used for executing child processes. 


4.3 Input/Output and Logging 


Since unprivileged spaces can access external I/O de- 
vices only indirectly via parent/child interaction within 
the space hierarchy, our user-level runtime treats I/O as 
a special case of file system synchronization. In addition 
to regular files, a process’s file system image can contain 
special I/O files, such as a console input file and a console 
output file. Unlike Unix device special files, Determina- 
tor’s I/O files actually hold data in the process’s file sys- 
tem image: for example, a process’s console input file 
accumulates all the characters the process has received 
from the console, and its console output file contains all 
the characters it has written to the console. In the current 
prototype this means that console or log files can even- 
tually “fill up” and become unusable, though a suitable 
garbage-collection mechanism could address this flaw. 

When a process does a read() from the console, 
the C library first returns unread data already in the pro- 
cess’s local console input file. When no more data is 
available, instead of returning an end-of-file condition, 
the process calls Ret to synchronize with its parent and 
wait for more console input (or in principle any other 
form of new input) to become available. When the par- 
ent does a wait () or otherwise synchronizes with the 
child, it propagates any new input it already has to the 
child. When the parent has no new input for any waiting 
children, it forwards all their input requests to its parent, 
and ultimately to the kernel via the root process. 

When a process does a console write (), the run- 
time appends the new data to its internal console output 
file as it would append to a regular file. The next time the 
process synchronizes with its parent, file system recon- 
ciliation propagates these writes toward the root process, 
which forwards them to the kernel’s I/O devices. A pro- 
cess can request immediate synchronization and output 
propagation by explicitly calling fsync(). 

The reconciliation mechanism handles “append-only” 
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Figure 6: A multithreaded process built from one space 
per thread, with a master space managing synchroniza- 
tion and memory reconciliation. 


writes differently from other file changes, enabling con- 
current writes to console or log files without conflict. 
During reconciliation, if both the parent and child have 
made append-only writes to the same file, reconciliation 
appends the child’s latest writes to the parent’s copy of 
the file, and vice versa. Each process’s replica thus ac- 
cumulates all processes’ concurrent writes, though dif- 
ferent processes may observe these writes in a different 
order. Unlike Unix, rerunning a parallel computation 
from the same inputs with and without output redirection 
yields byte-for-byte identical console and log file output. 


4.4 Shared Memory Multithreading 


Shared memory multithreading is popular despite the 
nondeterminism it introduces into processes, in part be- 
cause parallel code need not pack and unpack messages: 
threads simply compute “in-place” on shared variables 
and structures. Since Determinator gives user spaces no 
physically shared memory other than read-only sharing 
via copy-on-write, emulating shared memory involves 
distributed shared memory (DSM) techniques. Adapting 
the private workspace model discussed in Section 2.2 to 
thread-level shared memory involves reusing ideas ex- 
plored in early parallel Fortran machines [7,51] and in 
release-consistent DSM systems [2, 17], although none 
of this prior work attempted to provide determinism. 
Our runtime uses the kernel’s Snap and Merge opera- 
tions (Section 3.2) to emulate shared memory in the pri- 
vate workspace model, using fork/join synchronization. 
To fork a child, the parent thread calls Put with the Copy, 
Snap, Regs, and Start options to copy the shared part of 
its memory into a child space, save a snapshot of that 
memory state in the child, and start the child running, as 
illustrated in Figure 6. The master thread may fork mul- 
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tiple children this way. To synchronize with a child and 
collect its results, the parent calls Get with the Merge op- 
tion, which merges all changes the child made to shared 
memory, since its snapshot was taken, back into the par- 
ent. If both parent and child—or the child and other chil- 
dren whose changes the parent has collected—have con- 
currently modified the same byte since the snapshot, the 
kernel detects and reports this conflict. 

Our runtime also supports barriers, the foundation of 
data-parallel programming models like OpenMP [45]. 
When each thread in a group arrives at a barrier, it calls 
Ret to stop and wait for the parent thread managing 
the group. The parent calls Get with Merge to collect 
each child’s changes before the barrier, then calls Put 
with Copy and Snap to resume each child with a new 
shared memory snapshot containing all threads’ prior re- 
sults. While our private workspace model conceptually 
extends to non-hierarchical synchronization [5], our pro- 
totype’s strict space hierarchy currently limits synchro- 
nization flexibility, an issue we intend to address in the 
future. Any synchronization abstraction may be emulated 
at some cost as described in the next section, however. 

An application can choose which parts of its address 
space to share and which to keep thread-private. By plac- 
ing thread stacks outside the shared region, all threads 
can reuse the same stack area, and the kernel wastes no 
effort merging stack data. Thread-private stacks also of- 
fer the convenience of allowing a child thread to inherit 
its parent’s stack, and run “inline” in the same C/C++ 
function as its parent, as in Figure 1. If threads wish 
to pass pointers to stack-allocated structures, however, 
then they may locate their stacks in disjoint shared re- 
gions. Similarly, if the file system area is shared, then the 
threads share a common file descriptor namespace as in 
Unix. Excluding the file system area from shared space 
and using normal file system reconciliation (Section 4.2) 
to synchronize it yields thread-private file tables. 


4.5 Emulating Legacy Thread APIs 


As discussed in Section 2.3, we hope much existing se- 
quential code can readily be parallelized using naturally 
deterministic synchronization abstractions, like data- 
parallel models such as OpenMP [45] and SHIM [24] 
already offer. For code already parallelized using non- 
deterministic synchronization, however, Determinator’s 
runtime can emulate the standard pthreads API via deter- 
ministic scheduling [8, 10,21], at certain costs. 

In a process that uses nondeterministic synchroniza- 
tion, the process’s initial master space never runs ap- 
plication code directly, but instead acts as a determin- 
istic scheduler. This scheduler creates one child space 
to run each application thread. The scheduler runs the 
threads under an artificial execution schedule, emulating 
a schedule by which a true shared-memory multiproces- 


sor might in principle run them, but using a deterministic, 
virtual notion of time—namely, number of instructions 
executed—to schedule all inter-thread interactions. 

Like DMP [8, 21], our deterministic scheduler guan- 
tizes each thread’s execution by preempting it after exe- 
cuting a fixed number of instructions. Whereas DMP im- 
plements preemption by instrumenting user-level code, 
our scheduler uses the kernel’s instruction limit feature 
(Section 3.2). The scheduler “donates” execution quanta 
to threads round-robin, allowing each thread to run con- 
currently with other threads for one quantum, before col- 
lecting the thread’s shared memory changes via Merge 
and restarting it for another quantum. 

A thread’s shared memory writes propagate to other 
threads only at the end of each quantum, violating se- 
quential consistency [38]. Like DMP-B [8], our sched- 
uler implements a weak consistency model [28], totally 
ordering only synchronization operations. To enforce 
this total order, each synchronization operation could 
simply spin for a a full quantum. To avoid wasteful 
spinning, however, our synchronization primitives inter- 
act with the deterministic scheduler directly. 

Each mutex, for example, is always “owned” by some 
thread, whether or not the mutex is locked. The mutex’s 
owner can lock and unlock the mutex without scheduler 
interactions, but any other thread needing the mutex must 
first invoke the scheduler to obtain ownership. At the 
current owner’s next quantum, the scheduler “steals” the 
mutex from its current owner if the mutex is unlocked, 
and otherwise places the locking thread on the mutex’s 
queue to be awoken once the mutex becomes available. 

Since the scheduler can preempt threads at any 
point, a challenge common to any preemptive sce- 
nario is making synchronization functions such as 
pthread_mutex_lock() atomic. The kernel does 
not allow threads to disable or extend their own instruc- 
tion limits, since we wish to use instruction limits at pro- 
cess level as well, e.g., to enforce deterministic “time” 
quotas on untrusted processes, or to improve user-level 
process scheduling (see Section 4.1) by quantizing pro- 
cess execution. After synchronizing with a child thread, 
therefore, the master space checks whether the instruc- 
tion limit preempted a synchronization function, and if 
so, resumes the preempted code in the master space. Be- 
fore returning to the application, these functions check 
whether they have been “promoted” to the master space, 
and if so migrate their register state back to the child 
thread and restart the scheduler in the master space. 

While deterministic scheduling provides compatibility 
with existing parallel code, it has drawbacks. The master 
space, required to enforce a total order on synchroniza- 
tion operations, may be a scaling bottleneck unless exe- 
cution quanta are large. Since threads can interact only 
at quanta boundaries, however, large quanta increase the 
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time one thread may waste waiting to interact with an- 
other, to steal an unlocked mutex for example. 

Further, since the deterministic scheduler may pre- 
empt a thread and propagate shared memory changes at 
any point in application code, the programming model 
remains nondeterministic. In contrast with our private 
workspace model, if one thread runs ‘x = y’ while an- 
other runs ‘y = x’ under the deterministic scheduler, the 
result may be repeatable but is no more predictable to the 
programmer than on traditional systems. While rerun- 
ning a program with exactly identical inputs will yield 
identical results, if the input is perturbed to change the 
length of any instruction sequence, these changes may 
cascade into a different execution schedule and trigger 
schedule-dependent if not timing-dependent bugs. 


5 Prototype Implementation 


Determinator is written in C with small assembly frag- 
ments, currently runs on the 32-bit x86 architecture, and 
implements the kernel API and user-level runtime facil- 
ities described above. Source releases are available at 
‘http: //dedis.cs.yale.edu/’. 

Since our focus is on parallel compute-bound applica- 
tions, Determinator’s I/O capabilities are currently lim- 
ited. The system provides text-based console I/O and a 
Unix-style shell supporting redirection and both scripted 
and interactive use. The shell offers no interactive job 
control, which would require currently unimplemented 
“nondeterministic privileges” (Section 4.1). The system 
has no demand paging or persistent disk storage: the 
user-level runtime’s logically shared file system abstrac- 
tion currently operates in physical memory only. 

The kernel supports application-transparent space mi- 
gration among up to 32 machines in a cluster, as de- 
scribed in Section 3.3. Migration uses a synchronous 
messaging protocol with only two request/response types 
and implements almost no optimizations such as page 
prefetching. The protocol runs directly atop Ethernet, 
and is not intended for Internet-wide distribution. 

The prototype has other limitations already men- 
tioned. The kernel’s strict space hierarchy could bottle- 
neck J/O-intensive applications (Section 3.1), and does 
not easily support non-hierarchical synchronization such 
as queues or futures (Section 4.4). The file system’s size 
is constrained to a process’s address space (Section 4.2), 
and special I/O files can fill up (Section 4.3). None of 
these limitations are fundamental to Determinator’s pro- 
gramming model. At some cost in complexity, the model 
could support non-hierarchical synchronization [5]. The 
runtime could store files in child spaces or on external 
I/O devices, and could garbage-collect I/O streams. 

Implementing instruction limits (Section 3.2) requires 
the kernel to recover control after a precise number of 
instructions execute in user mode. While the PA-RISC 
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architecture provided this feature [1], the x86 does not, 
so we borrowed ReVirt’s technique [22]. We first set an 
imprecise hardware performance counter, which unpre- 
dictably overshoots its target a small amount, to interrupt 
the CPU before the desired number of instructions, then 
run the remaining instructions under debug tracing. 


6 Evaluation 


This section evaluates the Determinator prototype, first 
informally, then examining single-node and distributed 
parallel processing performance, and finally code size. 


6.1 Experience Using the System 


We find that a deterministic programming model sim- 
plifies debugging of both applications and user-level 
runtime code, since user-space bugs are always repro- 
ducible. Conversely, when we do observe nondetermin- 
istic behavior, it can result only from a kernel (or hard- 
ware) bug, immediately limiting the search space. 

Because Determinator’s file system holds a process’s 
output until the next synchronization event (often the 
process’s termination), each process’s output appears 
as a unit even if the process executes in parallel with 
other output-generating processes. Further, different pro- 
cesses’ outputs appear in a consistent order across runs, 
as if run sequentially. (The kernel provides a system call 
for debugging that outputs a line to the “real” console im- 
mediately, reflecting true execution order, but chaotically 
interleaving output as in conventional systems.) 

While race detection tools exist [25,43], we found it 
convenient that Determinator always detects conflicts un- 
der “normal-case” execution, without requiring the user 
to run a special tool. Since the kernel detects shared 
memory conflicts and the user-level runtime detects file 
system conflicts at every synchronization event, Deter- 
minator’s model makes conflict detection as standard as 
detecting division by zero or illegal memory accesses. 

A subset of Determinator doubles as PJOS, “Paral- 
lel Instructional Operating System,” which we used in 
Yale’s operating system course this spring. While the 
OS course’s objectives did not include determinism, they 
included introducing students to parallel, multicore, and 
distributed operating system concepts. For this purpose, 
we found Determinator/PIOS to be a useful instructional 
tool due to its simple design, minimal kernel API, and 
adoption of distributed systems techniques within and 
across physical machines. PIOS is partly derived from 
MIT’s JOS [35], and includes a similar instructional 
framework where students fill in missing pieces of a 
“skeleton.” The twelve students who took the course, 
working in groups of two or three, all successfully reim- 
plemented Determinator’s core features: multiproces- 
sor scheduling with Get/Put/Ret coordination, virtual 
memory with copy-on-write and Snap/Merge, user-level 
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Figure 7: Determinator performance relative to pthreads 
under Ubuntu Linux on various parallel benchmarks. 


threads with fork/join synchronization (but not determin- 
istic scheduling), the user-space file system with ver- 
sioning and reconciliation, and application-transparent 
cross-node distribution via space migration. In their fi- 
nal projects they extended the OS with features such as 
graphics, pipes, and a remote shell. While instructional 
use by no means indicates a system’s real-world utility, 
we find the success of the students in understanding and 
building on Determinator’s architecture promising. 


6.2 Single-node Multicore Performance 


Since Determinator runs user-level code “natively” on 
the hardware instead of rewriting user code [8,21], we 
expect it to perform comparably to conventional systems 
when executing single-threaded, compute-bound code. 
Since thread interactions require system calls, context 
switches, and virtual memory operations, however, we 
expect determinism to incur a performance cost in pro- 
portion to the frequency of thread interaction. 

Figure 7 shows the performance of several shared- 
memory parallel benchmarks we ported to Determina- 
tor, relative to the same benchmarks using conventional 
pthreads on 32-bit Ubuntu Linux 9.10. The md5 bench- 
mark searches for an ASCII string yielding a particu- 
lar MDS5 hash, as in a brute-force password cracker; 
matmult multiplies two 1024 x 1024 integer matrices; 
qsort is a recursive parallel quicksort on an integer ar- 
ray; blackscholes is a financial benchmark from the PAR- 
SEC suite [11]; and fft, Ju_cont, and lu_noncont are Fast 
Fourier Transform and LU-decomposition benchmarks 
from SPLASH-2 [57]. We tested all benchmarks on a 
2 socket x 6 core, 2.2GHz AMD Opteron PC. 

Coarse-grained benchmarks like md5, matmult, qsort, 
blackscholes, and fft show performance comparable with 
that of nondeterministic multithreaded execution under 
Linux. The md5 benchmark shows better scaling on De- 
terminator than on Linux, achieving a 2.25x speedup 
over Linux on 12 cores. We have not identified the pre- 
cise cause of this speedup over Linux but suspect scaling 
bottlenecks in Linux’s thread system [54]. 
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Figure 8: Determinator parallel speedup over its own 
single-CPU performance on various benchmarks. 
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Figure 9: Matrix multiply with varying matrix size. 


Porting the blackscholes benchmark to Determinator 
required no changes as it uses deterministically sched- 
uled pthreads (Section 4.5). The deterministic sched- 
uler’s quantization, however, incurs a fixed performance 
cost of about 35% for the chosen quantum of 10 million 
instructions. We could reduce this overhead by increas- 
ing the quantum, or eliminate it by porting the bench- 
mark to Determinator’s “native” parallel API. 


The fine-grained /u benchmarks show a higher per- 
formance cost, indicating that Determinator’s virtual 
memory-based approach to enforcing determinism is not 
well-suited to fine-grained parallel applications. Future 
hardware enhancements might make determinism practi- 
cal for fine-grained parallel applications, however [21]. 


Figure 8 shows each benchmark’s speedup relative to 
single-threaded execution on Determinator. The “embar- 
rassingly parallel” md5 and blackscholes scale well, mat- 
mult and fft level off after four processors (but still per- 
form comparably to Linux as Figure 7 shows), and the 
remaining benchmarks scale poorly. 


To quantify further the effect of parallel interaction 
granularity on deterministic execution performance, Fig- 
ures 9 and 10 show Linux-relative performance of mat- 
mult and qsort, respectively, for varying problem sizes. 
With both benchmarks, deterministic execution incurs a 
high performance cost on small problem sizes requiring 
frequent interaction, but on large problems Determinator 
is competitive with and sometimes faster than Linux. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) = 203 


204 





8 

6 

4 

2 

1 
0.8 
02° St call call ill 


1K 2K 5K 10K 20K 50K 100K 200K 500K 1M 
Array size (number of elements) 
M2CPUs M4CPUs M8CPUs HM 12 CPUs 


Speedup over Linux 
oO 
Oo 


Figure 10: Parallel quicksort with varying array size. 
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Figure 11: Speedup of deterministic shared memory 
benchmarks on varying-size distributed clusters. 


6.3 Distributed Computing Performance 


While Determinator’s rudimentary space migration (Sec- 
tion 3.3) is far from providing a full cluster comput- 
ing architecture, we would like to test whether such a 
mechanism can extend a deterministic computing model 
across nodes with usable performance at least for some 
applications. We therefore changed the md5 and mat- 
mult benchmarks to distribute workloads across a clus- 
ter of up to 32 uniprocessor nodes via space migration. 
Both benchmarks still run in a (logical) shared memory 
model via Snap/Merge. Since we did not have a clus- 
ter on which we could run Determinator natively, we ran 
it under QEMU [6], on a cluster of 2 socket x 2 core, 
2.4GHz Intel Xeon machines running SuSE Linux 11.1. 

Figure 11 shows parallel speedup under Determinator 
relative to local single-node execution in the same envi- 
ronment, on a log-log scale. In md5-circuit, the master 
space acts like a traveling salesman, migrating serially to 
each “worker” node to fork child processes, then retrac- 
ing the same circuit to collect their results. The md5-tree 
variation forks workers recursively in a binary tree: the 
master space forks children on two nodes, those children 
each fork two children on two nodes, etc. The matmult- 
tree benchmark implements matrix multiply with recur- 
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Figure 12: Deterministic shared memory benchmarks 
versus distributed-memory equivalents for Linux. 














Determinator PIOS 
Component Semicolons | Semicolons 
Kernel core 2044 1847 
Hardware/device drivers 751 647 
User-level runtime 2952 1079 
Generic C library code 6948 394 
User-level programs 1797 1418 
Total 14,492 5385 


Table 3: Implementation code size of the Determinator 
OS and of PIOS, its instructional subset. 


sive work distribution as in md5-tree. 

The “embarrassingly parallel” md5-tree performs and 
scales well, but only with recursive work distribution. 
Matrix multiply levels off at two nodes, due to the 
amount of matrix data the kernel transfers across nodes 
via its simplistic page copying protocol, which currently 
performs no data streaming, prefetching, or delta com- 
pression. The slowdown for 1-node distributed execution 
in matmult-tree reflects the cost of transferring the matrix 
to a (single) remote machine for processing. 

As Figure 12 shows, the shared memory md5-tree 
and matmult-tree benchmarks, running on Determina- 
tor, perform comparably to nondeterministic, distributed- 
memory equivalents running on Puppy Linux 4.3.1, in 
the same QEMU environment. Determinator’s clustering 
protocol does not use TCP as the Linux-based bench- 
marks do, so we explored the benchmarks’ sensitivity 
to this factor by implementing TCP-like round-trip tim- 
ing and retransmission behavior in Determinator. These 
changes resulted in less than a 2% performance impact. 

Illustrating the simplicity benefits of Determinator’s 
shared memory thread API, the Determinator version of 
md5 is 63% the size of the Linux version (62 lines con- 
taining semicolons versus 99), which uses remote shells 
to coordinate workers. The Determinator version of mat- 
mult is 34% the size of its Linux equivalent (90 lines ver- 
sus 263), which passes data explicitly via TCP. 


6.4 Implementation Complexity 


To provide a feel for implementation complexity, Table 3 
shows source code line counts for Determinator, as well 
as its PIOS instructional subset, counting only lines con- 
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taining semicolons. The entire system is less than 15,000 
lines, about half of which is generic C and math library 
code needed mainly for porting Unix applications easily. 


7 Related Work 


Recognizing the benefits of determinism [12,40], paral- 
lel languages such as SHIM [24] and DPJ [12, 13] en- 
force determinism at language level, but require rewrit- 
ing, rather than just parallelizing, existing serial code. 
Race detectors [25, 43] detect low-level heisenbugs in 
nondeterministic parallel programs, but may miss higher- 
level heisenbugs [3]. Language extensions can dynami- 
cally check determinism assertions [16, 49], but heisen- 
bugs may persist if the programmer omits an assertion. 

Early parallel Fortran systems [7,51], release con- 
sistent DSM [2, 17], transactional memory [33,52] and 
OS APIs [48], replicated file systems [53,55], and dis- 
tributed version control [29] all foreshadow Determina- 
tor’s private workspace programming model. None of 
these precedents create a deterministic application pro- 
gramming model, however, as is Determinator’s goal. 

Deterministic schedulers such as DMP [8, 21] and 
Grace [10] instrument an application to schedule inter- 
thread interactions on a repeatable, artificial time sched- 
ule. DMP isolates threads via code rewriting, while 
Grace uses virtual memory as in Determinator. De- 
veloped simultaneously with Determinator, dOS [9] in- 
corporates a deterministic scheduler into the Linux ker- 
nel, preserving Linux’s existing programming model and 
API. This approach provides greater backward compati- 
bility than Determinator’s clean-slate design, but makes 
the Linux programming model no more semantically de- 
terministic than before. Determinator offers new thread 
and process models redesigned to eliminate conventional 
data races, while supporting deterministic scheduling in 
user space for backward compatibility. 

Many techniques are available to log and replay non- 
deterministic events in parallel applications [39, 46, 56]. 
SMP-ReVirt can log and replay a multiprocessor virtual 
machine [23], supporting uses such as system-wide in- 
trusion analysis [22,34] and replay debugging [37]. Log- 
ging a parallel system’s nondeterministic events is costly 
in performance and storage space, however, and usu- 
ally infeasible for “normal-case” execution. Determi- 
nator demonstrates the feasibility of providing system- 
enforced determinism for normal-case execution, with- 
out internal event logging, while maintaining perfor- 
mance comparable with current systems at least for 
coarse-grained parallel applications. 

Determinator’s kernel design owes much to microker- 
nels such as L3 [41]. An interesting contrast is with 
the Exokernel approach [26], which is incompatible with 
Determinator’s. System-enforced determinism requires 
hiding nondeterministic kernel state from applications, 


such as the physical addresses of virtual memory pages, 
whereas exokernels deliberately expose this state. 


8 Conclusion 


While Determinator is only a proof-of-concept, it shows 
that operating systems can offer a pervasively and nat- 
urally deterministic application environment, avoiding 
the introduction of data races in shared memory and file 
system access, thread and process synchronization, and 
throughout the API. Our experiments suggest that such 
an environment can efficiently run coarse-grained paral- 
lel applications, both on a single multicore machine and 
across a cluster, though supporting fine-grained paral- 
lelism efficiently may require hardware evolution. 
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Abstract 


A deterministic multithreading (DMT) system eliminates 
nondeterminism in thread scheduling, simplifying the 
development of multithreaded programs. However, ex- 
isting DMT systems are unstable; they may force a pro- 
gram to (ad)venture into vastly different schedules even 
for slightly different inputs or execution environments, 
defeating many benefits of determinism. Moreover, few 
existing DMT systems work with server programs whose 
inputs arrive continuously and nondeterministically. 

TERN is a stable DMT system. The key novelty in 
TERN is the idea of schedule memoization that memo- 
izes past working schedules and reuses them on future 
inputs, making program behaviors stable across different 
inputs. A second novelty in TERN is the idea of win- 
dowing that extends schedule memoization to server pro- 
grams by splitting continuous request streams into win- 
dows of requests. Our TERN implementation runs on 
Linux. It operates as user-space schedulers, requiring no 
changes to the OS and only a few lines of changes to the 
application programs. We evaluated TERN on a diverse 
set of 14 programs (e.g., Apache and MySQL) with real 
and synthetic workloads. Our results show that TERN 
is easy to use, makes programs more deterministic and 
stable, and has reasonable overhead. 


1 Introduction 


Multithreaded programs are difficult to write, test, and 
debug. A key reason is nondeterminism: different runs of 
a multithreaded program may show different behaviors, 
depending on how the threads interleave [35]. 

Two main factors make threads interleave nondeter- 
ministically. The first is scheduling, how the OS and 
hardware schedule threads. Scheduling nondeterminism 
is not essential and can be eliminated without affecting 
correctness for most programs. The second is input, what 
data (input data) arrives at what time (input timing). In- 
put nondeterminism sometimes is essential because ma- 
jor changes in inputs require different schedules. How- 


ever, frequently input nondeterminism is not essential 
and the same schedule can be used to process many dif- 
ferent inputs (§2.2). We believe nonessential nondeter- 
minism should be eliminated in favor of determinism. 

Deterministic multithreading (DMT) systems [13, 22, 
41] make threads more deterministic by eliminating 
scheduling nondeterminism. Specifically, they constrain 
a multithreaded program such that it always uses the 
same thread schedule for the same input. By doing so, 
these systems make program behaviors repeatable, in- 
crease testing confidence, and ease bug reproduction. 

Unfortunately, though existing DMT systems elimi- 
nate scheduling nondeterminism, they do not reduce in- 
put nondeterminism. In fact, they may aggravate the ef- 
fects of input nondeterminism because of their design 
limitation: when scheduling the threads to process an 
input, they consider only this input and ignore previ- 
ous similar inputs. This stateless design makes schedules 
over-dependent on inputs, so that a slight change to in- 
puts may force a program to (ad)venture into a vastly dif- 
ferent, potentially buggy schedule, defeating many bene- 
fits of determinism. We call this the instability problem. 
This problem is confirmed by our results (§8.2.1) from 
an existing DMT system [13]. 

In fact, even with the same input, existing DMT sys- 
tems may still force a program into different schedules 
for minor changes in the execution environment such as 
processor type and shared library. Thus, developers may 
no longer be able to reproduce bugs by running their pro- 
gram on the bug-inducing input, because their machine 
may differ from the machine where the bug occurred. 

This paper presents TERN, a schedule-centric, stateful 
DMT system. It addresses the instability problem us- 
ing an idea called schedule memoization that memoizes 
past working schedules and reuses them for future inputs. 
Specifically, TERN maintains a cache of past schedules 
and the input constraints required to reuse these sched- 
ules. When an input arrives, TERN checks the input 
against the memoized constraints for a compatible sched- 
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Figure 1: Advantage of schedule memoization. Each solid 
shape represents an input, and each curved line a schedule. 
Schedule memoization reuses schedules when possible, avoid- 
ing bugs in unknown schedules and making program behaviors 
repeatable across similar inputs. 


ule. If it finds one, it simply runs the program while 
enforcing this schedule. Otherwise, it runs the program 
to memoize a schedule and the input constraints of this 
schedule for future reuse. By reusing schedules, TERN 
avoids potential errors in unknown schedules. This ad- 
vantage is illustrated in Figure 1. 

A real-world analogy to schedule memoization is the 
natural tendencies in humans and animals to follow fa- 
miliar routes to avoid possible hazards along unknown 
routes. Migrant birds, for example, often migrate along 
fixed “flyways.” We thus name our system after the Arc- 
tic Tern, a bird species that migrates the farthest among 
all migrants [2]. 

A second advantage of schedule memoization is that 
it makes schedules explicit, providing flexibility in de- 
ciding when to memoize certain schedules. For instance, 
TERN allows developers to populate a schedule cache of- 
fline, to avoid the overhead of doing so online. Moreover, 
TERN can check for errors (e.g., races) in schedules and 
memoize only the correct ones, thus avoiding the buggy 
schedules and amortizing the cost of checking for errors. 

To make TERN practical, it must handle server pro- 
grams which frequently use threads for performance. 
These programs present two challenges for TERN: (1) 
they often process client inputs (requests) as they arrive, 
thus suffering from input timing nondeterminism, which 
existing DMT systems do not handle and (2) they may 
run continuously, making their schedules effectively in- 
finite and too specific to reuse. 

TERN addresses these challenges using a simple idea 
called windowing. Our insight is that server programs 
tend to return to the same quiescent states. Thus, TERN 
splits the continuous request stream of a server into win- 
dows and lets the server quiesce in between, so that 
TERN can memoize and reuse schedules across windows. 
Within a window, it admits requests only at fixed sched- 
ule points, reducing timing nondeterminism. 

We implemented TERN in Linux. It runs as “para- 
sitic” user-space schedulers within the application’s ad- 
dress space, overseeing the decisions of the OS sched- 
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uler and synchronization library. It memoizes and reuses 

synchronization orders as schedules to increase perfor- 

mance and reuse rates. It tracks input constraints using 

KLEE [17], a symbolic execution engine. Our implemen- 

tation is software-only, works with general C/C++ pro- 

grams using threads, and requires no kernel modifica- 
tions and only a few lines of modification to applications, 
thus simplifying deployment. 

We evaluated TERN on a diverse set of 14 pro- 
grams, including two server programs Apache [10] and 
MySQL [4], a parallel compression utility PBZip2 [5], 
and 11 scientific programs in SPLASH2 [6]. Our work- 
load included a Columbia CS web trace and benchmarks 
used by Apache and MySQL developers. Our results 
show that 

1. TERN is easy to use. For most programs, we modi- 
fied only a few lines to adapt them to TERN. 

2. TERN enforces stability across different inputs. In 
particular, it reused 100 schedules to process 90.3% 
of a 4-day Columbia CS web trace. Moreover, while 
an existing DMT system [13] made three bugs in- 
consistently occur or disappear depending on minor 
input changes, TERN always avoided these bugs. 

3. TERN has reasonable overhead. For nine out of four- 
teen evaluated programs, TERN has negligible over- 
head or improves performance; for the other pro- 
grams, TERN has up to 39.1% overhead. 

4. TERN makes threads deterministic. For twelve out 
of fourteen evaluated programs, the schedules TERN 
memoized can be deterministically reused barring the 
assumption discussed in §7. 

Our main conceptual contributions are that we identi- 
fied the instability problem in existing DMT systems and 
proposed two ideas, schedule memoization and window- 
ing, to mitigate input nondeterminism. Our engineering 
contributions include the TERN system and its evaluation 
of real programs. To the best of our knowledge, TERN 
is the first stable DMT system, the first to mitigate in- 
put timing nondeterminism, and the first shown to work 
on programs as large, complex, and nondeterministic as 
Apache and MySQL. TERN demonstrates that DMT has 
the potential to be deployed today. 

This paper is organized as follows. We first present 
a background (§2) and an overview of TERN (§3). We 
then describe TERN’s interface (84), schedule memoiza- 
tion for batch programs (§5), and windowing to extend 
TERN to server programs (86). We then present refine- 
ments we made to optimize TERN (87). Lastly, we show 
our experimental results (88), discuss related work (89), 
and conclude (810). 


2 Background 


This section presents a background of TERN. We explain 
the instability problem of existing DMT systems (§2.1), 
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our choice of schedule representation in TERN (82.2), 
and why we can reuse schedules across inputs (§2.3). 


2.1. The Instability Problem 


A DMT system is, conceptually, a function that maps an 
input J to a schedule S. The properties of this function 
are that the same J should map to the same S' and that 
S is a feasible schedule for processing I. A stable DMT 
system such as TERN has an additional property: it maps 
similar inputs to the same schedule. Existing DMT sys- 
tems, however, tend to map similar inputs to different 
schedules, thus suffering from the instability problem. 

We argue that this problem is inherent in existing 
DMT systems because they are stateless. They must 
provide the same schedule for an input across differ- 
ent runs, using information only from the current run. 
To force threads to communicate (e.g., acquire locks or 
access shared memory) deterministically, existing DMT 
systems cannot rely on physical clocks. Instead, they 
maintain a logical clock per thread that ticks determin- 
istically based on the code this thread has run. More- 
over, threads may communicate only when their logical 
clocks have deterministic values (e.g., smallest across the 
logical clocks of all threads [41]). By induction, logical 
clocks make threads deterministic. 

However, the problem with logical clocks is that for 
efficiency, they must tick at roughly the same rate to 
prevent a thread with a slower clock from starving oth- 
ers. Thus, existing DMT systems have to tie their logical 
clocks to low-level instructions executed (e.g., completed 
loads [41]). Consequently, a small change to the input or 
execution environment may alter a few instructions exe- 
cuted, in turn altering the logical clocks and subsequent 
thread communications. That is, a small change to the 
input or execution environment may cascade into a much 
different (e.g., correct vs. buggy) schedule. 


2.2 Schedule Representation and Determinism 


Previous DMT systems have considered two types of 
schedules: (1) a deterministic order of shared memory 
accesses [13, 22] and (2) a synchronization order (i.e., a 
total order of synchronization operations) [41]. The first 
type of schedules are truly deterministic even if there are 
races, but they are costly to enforce on commodity hard- 
ware (e.g., up to 10 times overhead [13]). The second 
type can be efficiently enforced (e.g., 16% overhead [41]) 
because most code is not synchronization code and can 
run in parallel; however, they are deterministic only for 
inputs that lead to race-free runs [41, 46]. 

TERN represents schedules as synchronization orders 
for efficiency. An additional benefit is that synchroniza- 
tion orders can be reused more frequently than memory 
access orders (cf next subsection). Moreover, researchers 
have found that many concurrency errors are not data 





Program Input Constraints for Schedule Reuse 

PBZip2 Same number of file blocks (NumBlocks 
or —b) and threads (—p) 

Apache For groups of typical HTTP GET requests, 
same cache status and response sizes 

fft Same number of threads (—p) 

lu Same number of threads (—p), size of the 
matrix (—n), and block size (—b) 

barnes Same number of threads (NPROC) and val- 


ues of variables dt ime and tstop 


Table 1: Input constraints of five programs to reuse schedules. 
Identifiers without a dash are configuration variables, and those 
with a dash are command line options. 


races, but atomicity and order violations [39]. These er- 
rors can be deterministically reproduced or avoided using 
only synchronization orders. 

Although data races may still make runs which reuse 
schedules nondeterministic, TERN is less prone to this 
problem than existing DMT systems [41] because it has 
the flexibility to select schedules. If it detects a race in 
a memoized schedule, it can simply discard this sched- 
ule and memoize another. This selection task is often 
easy because most schedules are race-free. In rare cases, 
TERN may be unable to find a race-free schedule, result- 
ing in nondeterministic runs. However, we argue that in- 
put nondeterminism cannot be fully eliminated anyway, 
so we may as well tolerate some scheduling nondeter- 
minism, following the end-to-end argument. 


2.3. Why Can We Reuse Schedules? 


This subsection presents an intuitive and an empirical 
argument to support our insight that we can frequently 
reuse schedules for many programs/workloads. Intu- 
itively, synchronization operations map to developer in- 
tents of inter-thread control flow. By enforcing the 
same synchronization order, we fix the same inter-thread 
“path,” but still allow many different inputs to flow down 
this path. (This observation is similarly made for sequen- 
tial paths [11, 12, 26].) 

To empirically validate our insight, we studied the 
input constraints to reuse schedules for five programs, 
including a parallel compression utility PBZip2; the 
Apache web server; and three scientific programs fft, lu, 
and barnes in SPLASH2. Table 1 shows the results for 
all programs studied. We found that the input constraints 
were often general, allowing frequent reuses of sched- 
ules. For instance, PBZip2 can use the same schedule to 
compress many different files, as long as the number of 
threads and the number of file blocks remain the same. 


3 Overview 


Our design of TERN adheres to the following goals: 
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Figure 2: TERN architecture. Its components are shaded. 


1. Backward compatibility. We design TERN for gen- 
eral multithreaded programs because of their domi- 
nance in parallel programs today and likely tomor- 
row. We design TERN to run in user-space and on 
commodity hardware to ease deployment. 

2. Stability. We design TERN to bias multithreaded 
programs toward repeating their past, familiar sched- 
ules, instead of venturing into unfamiliar ones. 

3. Efficiency. We design TERN to be efficient because 
it operates during the normal executions of programs, 
not replayed executions. 

4. Best-effort determinism. We design TERN to make 
threads deterministic, but we sacrifice determinism 
when it contradicts the preceding goals. 

The remaining of this section presents TERN’s archi- 
tecture (§3.1), workflow (§3.2), deployment scenarios 

(83.3), and limitations (§3.4). 


3.1 Architecture 


Figure 2 shows the architecture of TERN and its five 
components: instrumentor, schedule cache, proxy, re- 
player, and memoizer. To use TERN, developers first 
annotates their application by marking the input data 
that may affect synchronization operations. They then 
compile their program with the instrumentor, which 
intercepts standard synchronization operations such as 
pthread_mutex_lock() so that at runtime TERN 
can control these operations. (We describe additional an- 
notations and instrumentations that TERN needs in 84). 
The instrumentor runs as a plugin to LLVM [3], requir- 
ing no modifications to the compiler. 

The schedule cache stores all memoized schedules and 
their input constraints. This cache can be marshalled to 
disk and read back upon program start, so that it need 
not be repopulated. Each memoized schedule is concep- 
tually a tuple (C, S'), where S is a synchronization order 
and Cis the set of input constraints required to reuse S. 
(We explain the actual representation in 85.2). 

At runtime, once an input J arrives, the proxy in- 
tercepts the input and queries the schedule cache for a 
constraint-schedule tuple (C;,.5;) such that I satisfies 
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1: main(int argc, char *argv[]) { 

28 int i, nthread = argv[1], nblock = argv[2]; 

3.2 symbolic(&nthread, sizeof(int)); // mark input data 

4: symbolic(&nblock, sizeof(int)); // that affects schedules 
5: for(i=0; i<nthread; ++i) 

6: pthread_create(worker); // create worker threads 

Ls for(i=0; i<nblock; ++i) { 

Oo block = read_block(i); // read i’th file block 

93 worklist.add(block); // add block to work list 

10: } 

11: 

12: worker() { // worker threads for compressing file blocks 
13: for(;;) { 

14: block = worklist.get(); // get a file block from work list 
15: compress(block); 

16: } 

17: } 


Figure 3: Simplified PBZip2 code. 


C;. Ona cache hit, the proxy lets the replayer run the 
program on input J and enforce schedule S;. On a cache 
miss, it lets the memoizer run the program on input J to 
memoize a new schedule. 

During a memoization run, the memoizer records all 
synchronization operations into a schedule S. It also 
computes C’, the input constraints for reusing S’,, via sym- 
bolic execution [17]. The basic idea of symbolic execu- 
tion is to track the outcomes of branches that observe 
symbolic data, in our case, the data marked by develop- 
ers as affecting synchronizations. Once the memoization 
run ends, the set of branch outcomes we collected de- 
scribes the input constraints needed to reuse the memo- 
ized schedule. 

For determinism, the memoizer can optionally check 
a memoization run for data races. If it detects no races, it 
simply stores (C,, 5’) into the schedule cache. Otherwise, 
it can discard the memoized schedule and rerun the pro- 
gram with a different scheduling algorithm to memoize 
another schedule. 

The proxy performs an additional task for server pro- 
grams to reduce input timing nondeterminism and to 
reuse schedules for these programs. Specifically, it 
buffers the requests of a server into a window with a fixed 
size. When the window becomes full, or remains partial 
for a predefined timeout, TERN runs the server to process 
the window as if the server were a batch program. It then 
lets the server quiesce before moving to the next window 
to avoid interference between windows. 


3.2 Workflow and An Example 


We illustrate how TERN works using PBZip2 as an ex- 
ample. Figure 3 shows the simplified code of PBZip2. 
Variables nthread and nblock affect synchroniza- 
tions, so developers mark them by calling the TERN- 
provided method symbolic () (line 3 and line 4). This 
code spawns nthread worker threads, splits the file 
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// main worker 1 worker 2 
9: worklist.add(); 

14: worklist.get(); 
9: worklist.add(); 


14: worklist.get(); 


Figure 4: Synchronization order of a PBZip2 run. 


5: 0 < nthread ? true 
5: 1 < nthread ? true 
5: 2 < nthread ? false 
7: 0 < nblock ? true 
7: 1 < nblock ? true 
7: 2 < nblock ? false 


Figure 5: Input constraints of a PBZip2 run. 


into nblock blocks, and compresses them in parallel 
by calling compress(). To coordinate the worker 
threads, it uses a synchronized work list. (Note TERN 
tracks low-level synchronizations such as pthread primi- 
tives; we use a work list here only for clarity.) 

Suppose we run PBZip2 with two threads on a two- 
block file. Suppose the schedule cache is empty and 
TERN runs the memoizer to memoize a new schedule. 
As PBZip2 runs, TERN controls and records the synchro- 
nization operations (line 9 and line 14). It also tracks 
the outcomes of branch statements that observe symbolic 
data (line 5 and line 7). At the end of the run, TERN 
records a schedule as shown in Figure 4. It also col- 
lects constraints as shown in Figure 5, which simplify 
to nthread = 2 A nblock = 2.' It stores the schedule 
and the input constraints into the schedule cache. 

If we run PBZip2 again with two threads on a different 
two-block file, TERN will check if variable nthread 
and nblock satisfy any set of constraints in the schedule 
cache. In this case, TERN will succeed. It will then reuse 
the schedule (Figure 4) to compress the file, even though 
the file data may differ completely. 


3.3. Deployment Scenarios 


We anticipate three ways users may deploy TERN to 
make their programs stable and deterministic. 
Schedule-carrying code. Developers pre-populate a 
cache of correct, representative schedules on typical 
workloads, then ship their program with the cache hard- 
wired and marked read-only. 

Online memoization. Users can turn on memoization 
at their local sites so that TERN can memoize schedules 
as the programs run on real inputs. 

Shadow memoization. Since tracking input constraints 
is slow, users can configure TERN to memoize schedules 
asynchronously. Specifically, for an input that misses the 


' Although in this example the constraints are collected from one 
thread, TERN can actually collect constraints from multiple threads. 


schedule cache, the proxy runs the program as is, while 
forwarding a copy of the input to the memoizer. 

Each deployment mode has pros and cons. The first 
mode makes a program stable and deterministic across 
different sites, but may react poorly to site-specific work- 
loads. The second mode updates the schedule cache 
based on site-specific workloads, but may be slow be- 
cause memoization runs tend to be slow. The last ap- 
proach avoids the slowdown, but allows a program to run 
nondeterministically when an input misses the schedule 
cache. For server programs with high performance re- 
quirements, we recommend the first and the third modes. 


3.4 Limitations 


Determinism. TERN aims for best-effort determinism 
for reasons discussed in §2.2. If TERN is unable to find 
a race-free schedule for an input, the run may be nonde- 
terministic. We foresee several strategies to handle this 
corner case while adhering to the other goals of TERN. 
For instance, we can instrument the program to fix the 
detected races or apply one of the existing DMT algo- 
rithms to resolve the races deterministically. The advan- 
tage of combining these techniques with TERN is that 
we apply these expensive techniques only to a small por- 
tion of schedules, and use TERN to efficiently handle the 
common case. We leave these ideas for future work. 
Applicability. We anticipate our approach will work 
well for many programs/workloads as long as (1) they 
can benefit from determinism and stability, (2) their con- 
straints can be tracked by TERN, (3) their schedules can 
be frequently reused, and (4) if windowing is needed, 
their inputs can be buffered. For programs/workloads 
that violate these assumptions, TERN may work poorly. 
These programs/workloads may include parallel simula- 
tors that require nondeterminism for statistical results, 
GUI programs that cannot buffer user actions for la- 
tency reasons, randomly generated workloads that pre- 
vent schedule reuses, and programs whose schedules de- 
pend on floating point inputs (which cannot be tracked 
by TERN’s underlying symbolic execution engine). 
Manual annotation. TERN requires manual annota- 
tions. However, this annotation overhead tends to be 
small. (See $7.4 for how TERN reduces this overhead 
and §8.1 for an evaluation of this overhead). This over- 
head may be further reduced using simple static analysis. 


4 Interface 


Table 2 shows TERN’s annotation interface which de- 
velopers and the instrumentor use to annotate multi- 
threaded programs. The annotations fall into four cat- 
egories: (1) symbolic() for marking data that may 
affect schedules; (2) task boundary annotations for mark- 
ing the beginning and end of logical tasks, in case threads 
get reused for different logical tasks (86); (3) wrap- 
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Annotations Inserted by Semantics 

symbulicGiala. Jen). Developer Marks data that may affect schedules. The memoizer tracks constraints on this 
: data. The replayer checks this data against the memoized constraints. 

begin_task() Developer Mark the beginning and end of a logical task. Often used to divide the executions 

end_task() of threads in a pool into separate tasks (86). 

lock_wrapper(/) Developer Synchronization wrappers. The memoizer intercepts these operations for 

unlock_wrapper(/) or TERN memoizing schedules, and the replayer intercepts them for reusing schedules. 

before_blocking() oe Inserted before and after blocking system calls. The memoizer logs the order of 


after_blocking() 


these calls. The replayer opportunistically enforces the same order of these calls. 


Table 2: TERN interface. Some annotations are inserted by developers, and others are inserted by the instrumentor, indicated by 
Column Inserted By. Both the memoizer and the replayer use this interface, but they implement this interface differently (§5). 


pers to synchronization operations (more examples in the 
next paragraph); and (4) hook functions inserted around 
blocking system calls, which TERN memoizes because 
blocking systems calls are natural scheduling points. 
Currently TERN hooks 28 pthread operations (e.g., 
pthread_mutex_lock(), pthread_create(), 
and pthread_cond_wait ()). It also handles com- 
mon atomic operations such as atomic_dec() and 
atomic_inc(). It hooks eight blocking system calls 
(e.g., sleep(), accept(), recv(), select (), 
and read()). These hooks are sufficient to run the 
programs evaluated, and we can easily add more. 
Developers manually insert annotations in the first two 
categories. They also annotate custom synchronizations 
(e.g., custom spin locks). TERN’s instrumentor automat- 
ically hooks standard synchronization and blocking sys- 
tem calls. These annotations allow TERN’s memoizer 
and replayer to run as “parasitic” user-space schedulers 
that oversee the scheduling decisions of the OS and syn- 
chronization library, requiring no modifications to either. 


5 Schedule Memoization 


This section presents the idea of schedule memoiza- 
tion in the context of batch programs. We describe 
how TERN memoizes schedules (85.1), tracks input con- 
straints (85.2), merges a schedule into the schedule cache 
(§5.3), and reuses schedules (85.4). 


5.1 Memoizing Schedules 


To memoize schedules, the memoizer controls and logs 
synchronization operations. By default, it uses a sim- 
ple round-robin (RR) algorithm that forces each thread 
to do synchronizations in turn. One advantage of this al- 
gorithm is that independent sites may memoize the same 
schedules, making program behaviors deterministic and 
stable across sites. 

The memoizer implements this algorithm by imple- 
menting the wrapper functions in Table 2. Figure 6 
shows the wrappers to pthread_mutex_lock () and 
pthread_mutex_unlock(). The memoizer main- 
tains a queue of active threads. Only the thread at the 
head of the queue “has the turn” (line 4 and 14). Once 
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1: queue_t activeq, waitq[N]; 

2: pthread_mutex_lock_wrapper(pthread_mutex_t “mutex) { 
3: retry: 

4: while(self()!=activeq.head); // wait for our turn 

5s if(!phtread_mutex_trylock(mutex)) { // mutex acquired 
6: append(schedule, self()); // add tid to schedule 

7: move(self(), activeq.tail); // give turn to next thread 
8: return; 

9: 

10: move(self(), waitq[mutex].tail); // deterministically wait 
11: goto retry; // wait for our turn again 

12: 

13: pthread_mutex_unlock_wrapper(pthread_mutex_t *mutex) { 
14: while(self()!=activeq.head); // wait for our turn 

15: pthread_mutex_unlock(mutex); // mutex released 

16: wake_up(waitq[mutex].head); // deterministically wake up 
17: append(schedule, self()); // add tid to schedule 

18: move(self(), activeq.tail); // give turn to next thread 
19: } 


Figure 6: The memoizer’s round-robin scheduling algorithm. 


the thread is done with the operation, it gives up the turn 
by moving itself to the tail of the queue (line 7 and 18). 
We explain three subtleties of the code. First, to avoid 
the deadlock scenario when the head of the queue at- 
tempts to grab an unavailable mutex, we call the non- 
blocking lock operation instead of the blocking one (line 
5). If the mutex is not available, the thread gives up its 
turn and waits on a TERN-maintained wait queue (line 
10). TERN uses its own wait queues to avoid nondeter- 
ministic wakeup orders in pthread library. Second, we 
log synchronizations (line 6 and line 17) only when the 
thread has the turn, so that the log faithfully reflects the 
actual order of synchronizations. Lastly, we maintain our 
internal thread IDs to avoid nondeterminism in the OS 
thread IDs across runs. Function self () returns this 
internal ID for the current thread (line 6 and line 17). 
The memoizer allows a thread to break out of the 
round-robin when the thread has waited for its turn for 
over a second. The rationale is that if a thread has waited 
too long, the current schedule will likely perform poorly 
in reuse runs. However, such timeouts do not affect non- 
determinism, because the memoizer still logs the order of 
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the occurred operations and the replayer simply enforces 
the same order. In our experiments, we never observed 
such timeouts because most threads synchronize or call 
blocking system calls frequently. 

Unlike previous DMT systems, TERN has the flexibil- 
ity to select scheduling algorithms. In addition to the RR 
algorithm, it implements a first-come first-served (FCFS) 
algorithm that lets threads run as is. If the memoizer de- 
tects a race using RR, it can restart the run and switch to 
FCES. Implementing FCFS requires only minor modifi- 
cations to the algorithm presented in Figure 6. Specifi- 
cally, we replace line 4 and line 14 with a lock operation; 
line 7, line 10, and line 18 with an unlock operation; and 
line 16 a NOP. 

In addition to synchronizations, the memoizer in- 
cludes the hooks around blocking system calls (§4) in 
the schedule it memoizes because blocking system calls 
are natural scheduling points. However, the replayer will 
only opportunistically replay these hooks when reusing a 
schedule because the returns from blocking system calls 
are driven by the program’s environment. 


5.2 Tracking Input Constraints 


Given the symbolic data marked by developers, the mem- 
oizer tracks the constraints on this data by tracking (1) 
what data is derived from the symbolic data and (2) the 
outcomes of the branch statements that observe this sym- 
bolic and derived data. At the end of this memoiza- 
tion run, the set of branch outcomes together describe 
the constraints to place on the symbolic data required to 
reuse the memoized schedule. That is, if an input satis- 
fies these constraints, we can re-run the program in the 
same way as the memoization run. The constraints col- 
lected this way may be over-constraining if developers 
annotate too much data as symbolic. We describe a tech- 
nique to address this problem in $7.4. 

TERN leverages KLEE [17], an open-source symbolic 
execution engine to track input constraints. To adapt 
KLEE to TERN, we made two key modifications. First, 
KLEE works only with sequential programs, thus we ex- 
tended it to support threads. Specifically, we modified 
KLEE to spawn a new KLEE instance for each new thread. 
At the end of the run, we unify the constraints collected 
from each thread as the input constraints of the schedule. 
Second, we simplified KLEE to only collect constraints 
without solving them, because unlike KLEE, TERN need 
not explore different execution paths. 


5.3 Merging Schedules into the Schedule Cache 


Once TERN memoized a schedule S and its constraints 
C’, TERN stores the tuple into the schedule cache. Al- 
though the schedule cache is conceptually a set of (C’, S) 
tuples, its actual structure is a decision tree because a 
program may incrementally read inputs from its environ- 
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Figure 7: Decision tree of TERN’s schedule cache. 


ment, calling symbolic() multiple times. For exam- 
ple, the code in Figure 3 calls symbolic () twice. 

Figure 7 illustrates how TERN constructs the deci- 
sion tree of the schedule cache. Given a (C,S) tuple, 
TERN breaks it down to sub-tuples (C;, S;) separated by 
symbolic () calls, where S; contains the synchroniza- 
tion operations logged and C; contains the constraints 
collected between the i*” and (i + 1)” symbolic () 
calls. It then merges the sub-tuples into the i*” level of 
the decision tree. 

TERN avoids merging redundant tuples into the cache. 
That is, if the cache contains a tuple with less restrictive 
constraints that the tuple being merged, TERN simply 
discards the new tuple. Note that the tuples may overlap 
(i.e., one input satisfies more than one set of constraints), 
and TERN simply returns the first match if there are mul- 
tiple matches. 

To speed up cache lookup, TERN sorts all (C;, $;) tu- 
ples within the same decision node based on their reuse 
rates, defined as the number of successful reuses of S; 
over the number of inputs that have satisfied C;. Reusing 
a schedule may fail even if the input satisfies the sched- 
ule’s input constraints (cf next subsection). However, 
by sorting the tuples based on reuse rates, we automati- 
cally prefer good schedules over bad ones that have many 
failed reuse attempts. To bound the size of the sched- 
ule cache, TERN can throw away bad schedules based on 
reuse rates. However, we have not found the need to do 
so because the schedule cache is often small. 


5.4 Reusing Schedules 


To reuse a schedule, TERN must check that the input sat- 
isfies the input constraints of the schedule. To do so, it 
maintains an iterator to the decision tree of the sched- 
ule cache. The iterator starts from the root. As the pro- 
gram runs and calls symbolic (), TERN moves the it- 
erator down the tree. It checks if the data passed into a 
symbolic () call satisfies any set of constraints stored 
at the corresponding decision tree node and, if so, en- 
forces the corresponding schedule. 
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pthread_mutex_lock_wrapper(mutex) { 
down(sem[self()]); // wait for our turn 
pthread_mutex_lock(mutex); 
next = shift schedule; // find next thread in schedule 
up(sem[next]); // wake up next thread 


ouarwWwNn— 


Figure 8: Pseudo code of the replayer. 


The performance of the replayer is crucial because 
it runs during a program’s normal executions. To effi- 
ciently enforce a synchronization order, the replayer uses 
a technique we call semaphore relay. Specifically, the 
replayer assigns each thread a semaphore. Before doing 
a synchronization operation, a thread has to wait on its 
semaphore for its turn. Once it is done with the oper- 
ation, it passes the turn to the next thread in the sched- 
ule by signaling the semaphore of the next thread. Com- 
pared to an approach using locks or condition variables, 
semaphore relay avoids unnecessary lock contentions. 
Figure 8 illustrates semaphore relay using the replayer’s 
pthread_mutex-_lock () wrapper. 

We note several subtleties of the pseudo code in Fig- 
ure 8. First, we do not use non-blocking lock operations 
(line 3) as in Figure 6 because the memoizer only logs 
successful lock acquisitions. Second, the replayer main- 
tains internal thread IDs the same way as the memoizer 
to avoid mismatches. Lastly, the down () (line 2) is ac- 
tually a timed wait (with a default 0.1ms timeout), so that 
a thread can break out of a schedule when the dynamic 
load mismatches the schedule’s assumptions. Note that 
these timeouts merely cause delays and do not affect cor- 
rectness. They rarely occurred in our experiments. 


6 Windowing 


Server programs present two challenges for TERN. First, 
they are more exposed to timing nondeterminism than 
batch programs because their inputs (client requests) ar- 
rive nondeterministically. Second, they often run contin- 
uously, making their schedules too specific to reuse. 
TERN addresses these challenges using a simple idea 
called windowing. Our insight is that server programs 
tend to return to the same quiescent states. Thus, in- 
stead of processing requests as they arrive, TERN breaks 
a continuous request stream down to windows of re- 
quests. Within each window, it admits requests only at 
fixed points in the current schedule. If no requests ar- 
rive at an admission point for a predefined timeout, TERN 
simply proceeds with the partial window. While a win- 
dow is running, TERN buffers newly arrived requests so 
that they do not interfere with the running window. With 
this approach, TERN can memoize and reuse schedules 
across (possibly partial) windows. The cost of window- 
ing is that it may reduce concurrency and degrade server 
throughput and speed. However, our experiments show 
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that this cost is reasonable and justified by the gain in 
determinism and stability. 

To buffer requests, TERN needs to know when a 

server receives a request and when it is done process- 
ing the request. Inferring these task boundaries based 
on thread creation and exit is unreliable because server 
programs frequently use thread pools. Thus, TERN cur- 
rently lets developers annotate these boundaries using 
begin_task () and end_task (). Manually locating 
task boundaries is often easy: a request tends to begin 
after an accept () of aclient connection and ends after 
the server sends out a reply. 
Exposing hidden states. The assumption of windowing 
is that a server program returns to the same state when it 
quiesces. However, in practice, server states evolve over 
time. For instance, when Apache first serves a page, it 
may load the page from disk and cache it in memory. 
When this page is requested again, Apache can serve it 
directly from its cache. 

These state changes may affect schedules. In the ex- 
ample above, Apache will perform different synchro- 
nizations for the two runs. Thus, for TERN to accurately 
select a schedule to reuse, it must know the hidden states 
that affect schedules. Currently TERN lets developers 
annotate such hidden states using symbolic(). Do- 
ing so is often straightforward. For instance, we inserted 
a symbolic() call to mark the return of Apache’s 
cache_find() as symbolic. 

Exposing hidden states may not always be easy. 
We thus created a technique to tolerate missed 
symbolic() annotations. The basic idea is to store 
backup schedules under the same set of input constraints 
to tolerate annotation inaccuracy. For instance, sup- 
pose a symbolic () had not been missed, TERN would 
have memoized two different constraint-schedule tuples 
(C, $1) and (C2, 52). However, because of the missed 
annotation, TERN missed the corresponding constraints, 
wrongly collapsing C and C2 into the same set C. 
Now the two original tuples become (CS) and (C, $2), 
which appear redundant. Instead of discarding one of 
these seemingly redundant schedules, TERN will store 
both schedules with the same set of constraints. To se- 
lect between these schedules, TERN can select the one 
with higher reuse rate, which likely matches the hidden 
state of the program. 


7 Refinements 


This section describes four refinements we made, one for 
determinism (§7.1) and three for speed (§7.2-87.4). 


7.1 Detecting Data Races 


As discussed in §2.2, if a memoized schedule allows data 
races, runs reusing this schedule may become nondeter- 
ministic. Thus, for determinism, we would like to de- 
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+4+X} lock(11); 
lock(11); a{i]++; lock(12); 
lock(12); a{j]——; 
+4+X} unlock(11); 
unlock(12); 


Figure 9: A conventional 


Figure 10: A symbolic race 
race, not a schedule race. 


that occurs only when i = j. 


tect races in memoized schedules and discard them from 
the schedule cache. A general race detector would flag 
too many races for TERN because it detects conventional 
races with respect to the original synchronization con- 
straints of the program, whereas we want to detect races 
with respect to the order constraints of a schedule [46] 
(call them schedule races). Figure 9 shows a conven- 
tional race, but not a schedule race because the synchro- 
nization order shown “kills” the race. 

Thus, we built a simple race detector to detect sched- 
ule races. It runs with the memoizer and is happens- 
before based. It considers one memory access happens 
before another with respect to the synchronization order 
the memoizer records. Sometimes a pair of instructions 
may appear to be a race, when in fact their relative order 
does not alter a run. For instance, a write-write race is 
benign if both instructions write the same value. Simi- 
larly, a read-write race is benign if the value written by 
one instruction does not affect the value read by another. 
Our race detector prunes these benign races. 

Our detector also flags symbolic races, the races that 
are data-dependent on inputs. Figure 10 shows an exam- 
ple. Both variables 7 and 7 are inputs, and the race occurs 
only when 2 = 7. The risk of a symbolic races is that it 
may be absent in a memoization run and thus skip de- 
tection, but show up nondeterministically in a reuse run. 
To detect symbolic races, our race detector queries the 
underlying symbolic execution engine for pointer equal- 
ity. For example, to detect the race in Figure 10, it would 
query the underlying symbolic execution engine for the 
satisfiability of &a[i] = &alj]. It flags a symbolic race 
if this constraint is satisfiable. Once a symbolic race is 
flagged, TERN adds additional input constraints to ensure 
that the race does not occur in reuse runs. For Figure 10, 
we would add &aji] 4 &alj], which simplifies to i A j. 

Our race detector can detect all schedule races in a 
memoization run. It can also detect all symbolic races 
if developers correctly annotate all data that affect syn- 
chronization operations and memory locations accessed. 
If this assumption holds and our race detector reports no 
races in a memoization run, TERN ensures that the mem- 
oized schedule can be deterministically reused. 


7.2 Skipping Unnecessary Synchronizations 


When reusing a schedule, TERN enforces a total syn- 
chronization order according to the schedule. These 


TERN-enforced execution order constraints are more 
stringent than the constraints enforced by the origi- 
nal synchronizations in the program. Thus, for speed, 
TERN can actually skip these unnecessary synchroniza- 
tions. In our current implementation, we skip sleep (), 
usleep(), and pthread barrier _wait() be- 
cause they are frequently used. We found that this op- 
timization was quite effective and even made programs 
run faster than nondeterministic execution (§8.3). 


7.3 Simplifying Constraints 


To reuse a schedule, TERN must check if the current in- 
put satisfies the constraints of the schedule. The over- 
head of this check depends on the number of constraints, 
yet the set of constraints TERN collects may not always 
be in simplified form. That is, a subset of the con- 
straints may imply the entire set. For example, consider 
aloop “for(int i=0;i!=n;++i)” witha symbolic 
bound n. When running this code with n = 10, we will 
collect a set of constraints {0 4 n,1 #4 n,...,10 = n}, 
but the last constraint alone implies the entire set. 


To simplify constraints, TERN uses a greedy algo- 
rithm. Given a set of constraints C’,, it iterates through 
each constraint c, and checks if C/{c} implies {c}. If 
so, it simply discards c. Our observation is that con- 
straints collected later in a run tend to be more compact 
than the earlier ones. Thus, when pruning constraints, we 
start from the ones collected earlier. Although we could 
have used the underlying symbolic execution engine to 
simplify constraints, it lacks this domain knowledge and 
may perform poorly. 


7.4 Slicing Out Irrelevant Branches 


A branch statement may observe a piece of symbolic 
data but perform no synchronization operation in either 
branch. The constraints collected from this branch are 
unlikely to affect schedules. If we include irrelevant con- 
straints in the input constraints of a schedule, we not only 
increase constraint checking time, but also preclude legal 
reuses of the schedule. 


To address this problem, TERN employs a simple 
static analysis to automatically prune likely irrelevant 
constraints. At the heart of this technique is a slicing 
analysis that identifies branch statements unlikely to af- 
fect synchronization operations. Specifically, given a 
branch statement s, this analysis computes sq, the im- 
mediate post-dominator [8] of s, and marks s as irrele- 
vant if no synchronization operations are between s and 
sq. Although simple, this technique reduced constraint 
checking time significantly (§8.3). However, we note 
that our analysis is unsound because it ignores data de- 
pendencies. Thus, we plan to implement a sound slicing 
algorithm [21] in our future work. 
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Program Size Symbolic Task Sync Total 
Apache 464K +f 2 0 6 (+1) 
MySQL 1,182K 1 2 0 3 (428) 
PBZip2 1,551 3 N/A 0 3 

fft 1,403 4 N/A 0 4 

lu 1,265 3 N/A 0 3 
barnes 1,954 9 N/A 0 9 
radix 661 4 N/A 0 4 
fmm 3,208 8 N/A 1 9 
ocean 6,494 5 N/A 0 5 
volrend 18,082 1 N/A 1 2 
water-spatial 1,573 9 N/A 0 9 
raytrace 5,808 3 N/A 0 3 
water-nsquared 1,188 10 N/A 0 10 
cholesky 3,683 3 N/A 1 4 


Table 3: Statistics of programs evaluated. Size counts the 
lines of code for each program. Symbolic counts the sym- 
bolic variables we marked. Task counts the task boundary an- 
notations (begin_task() and end_task ()) we inserted. 
Sync counts the annotations for custom synchronizations we 
inserted. The numbers in parenthesis under Total count mis- 
cellaneous changes. 


8 Evaluation 


Our TERN implementation consists of 8,934 lines of C++ 
code, including 827 lines for the instrumentor imple- 
mented as an LLVM pass; 5,451 lines for the proxy, 
schedule cache, memoizer, and replayer; and 2,656 lines 
for modifications to KLEE. 

We evaluated TERN on a diverse set of 14 programs, 
ranging from two server programs, Apache and MySQL, 
to one parallel compression utility, PBZip2, to 11 scien- 
tific programs in SPLASH2.” 

Our main evaluation machine is a 2.66 GHz quad-core 
Intel machine with 4 GB memory running Linux 2.6.24. 
When evaluating TERN on server programs, we ran the 
server on this machine and the client on another to avoid 
unnecessary contention. These machines are connected 
via |Gbps LAN. We compiled all programs down to ma- 
chine code using Llvm-gcc -—0O2 and LLVM’s bitcode 
compiler llc. 

We focused our evaluation on four key questions: 

1. Is TERN easy to use (88.1)? 
2. Does TERN make multithreaded programs stable 
across different inputs (§8.2)? 
. Does TERN incur high overhead (88.3)? 
4. Does TERN make multithreaded programs determin- 
istic on the same input (88.4)? 


8.1 Ease of Use 


Table 3 summarizes the modifications we made to make 
the programs work with TERN. For each program but 
MySQL, we modified only 3-10 lines. For Apache, we 
marked the HTTP command, URL, HTTP version, and 


uo 


>The version of the SPLASH2 [36] we acquired has 12 programs, 
one of which does not compile on our evaluation machine. 
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Table 4: Bug stability results on SPLASH2 fft. The 
leftmost column and the bottommost row show the com- 
mand line arguments. Option -p specifies the number of 
threads, and -m the amount of computation (matrix size). 
Symbol % indicates that the bug occured, and W the bug 
never occured. 


the return of cache_find() as symbolic (86). For 
MySQL, we marked the SQL query. For PBZip2, we 
marked the number of threads and file blocks. (The num- 
ber of file blocks is set in two places, contributing two 
symbolic annotations.) For all these scientific programs, 
we marked all input arguments as symbolic except those 
configuring output verbosity.2 We marked three cus- 
tom synchronization operations in three SPLASH2 pro- 
grams. We made two miscellaneous changes to Apache 
and MySQL. The line counts are shown in parenthesis 
under the Total column. For Apache, we had to fix an 
uninitialized memory read in ap_signal_server () 
to make it work with KLEE. For MySQL, we wrote a 28- 
line function to mark the numbers in each SQL query as 
concrete (i.e., not affecting schedules) to avoid making 
the input constraints too specific. 


8.2. Stability 


We evaluated TERN’s stability via two sets of experi- 
ments. The first set compares it to an existing DMT sys- 
tem (88.2.1), the second quantifies how frequently it can 
reuse schedules on real and synthetic workloads (88.2.2). 


8.2.1 Bug Stability 


We compared TERN to COREDET [13] in terms of bug 
stability: does a bug occur in one run but disappear in an- 
other when the input varies slightly? We ran three buggy 
SPLASH2 programs, fft, lu, and barnes, in three modes: 
nondeterministic execution (Nondet), with COREDET, 
and with TERN. We varied their inputs by varying the 
number of threads and the amount of computation. For 
each program, execution mode, and input combination, 
we ran the program 100 times, and recorded whether the 
corresponding bug occurred. 

We present only the fft results; the results of the other 
programs are similar. Table 4 shows the buggy behav- 
iors of fft. In nondeterministic mode, the bug never oc- 
curred, despite that each run almost always yielded a new 
synchronization order. With COREDET, slight changes 


3Note that we could have used a two-line loop to mark these argu- 
ments as symbolic. Instead, we report the total number of symbolic 
variables to avoid masking real data. 
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Program-Workload Reuse Rates (%) Schedules 
Apache-CS 90.3% 100 
SysBench-simple 94.0% 50 
SysBench-tx 44.2% 109 
PBZip2-usr 96.2% 90 


Table 5: TERN stability. Column Schedules indicates the 
number of schedules in the schedule cache. 


in computation made the bug occur or disappear. With 
TERN, the bug never occurred, and TERN reused only 
three schedules for all runs, one for each thread count. 


8.2.2 Reuse Rates 


We also quantified how frequently TERN could reuse 
schedules. Specifically, we measured the overall reuse 
rate, defined as the number of inputs processed using 
memoized schedules over the total number of inputs. The 
higher the reuse rates, the more stable the programs be- 
come. TERN had nearly 100% overall reuse rates for the 
scientific programs after a small number of memoization 
runs. Thus, we focused on Apache, MySQL, and PBZip2 
in out experiments. 

We used four workloads to evaluate overall reuse rates: 

Apache-CS: a real 4-day trace from the Columbia CS 
website with 122,000 HTTP requests. We wrote a 
script to replay this trace at a rate of 100 concurrent 
requests per second. 

SysBench-simple: SysBench [7] in simple mode. This 
synthetic workload consists of random select queries. 

SysBench-tx: SysBench in transaction mode. This syn- 
thetic workload consists of random select, update, 
delete, and insert queries. 

PBZip2-usr: a random selection of 10,000 files from 
/usvr on our evaluation machine. 

For each workload, we first randomly selected 1%-3% 
of the workload and ran the memoizer to populate the 
schedule cache. We then ran the entire workload with 
the replayer and measured the overall reuse rates. We 
ran eight worker threads for each program because they 
performed best (with or without TERN) with this setting. 

Table 5 shows the results. For three out of the four 
workloads, TERN could reuse a small number of sched- 
ules to process over 90% of the inputs. For MySQL- 
tx, TERN had a lower overall reuse rate. The reasons 
are two fold. First, this workload makes it unlikely to 
reuse schedules because it mixes many randomly gener- 
ated queries with different types and parameters. Second, 
we annotated only the SQL command as symbolic with- 
out exposing the hidden states of MySQL (86) so that 
we could measure TERN’s performance in an adversarial 
setting. Nonetheless, TERN managed to process 44.2% 
of inputs with a small number of schedules. 
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Figure 11: Relative overhead of the replayer over nondeter- 
ministic execution. Negative overhead means speedup. 


8.3. Overhead 


We used the following workloads to evaluate TERN’s 
overhead. For Apache, we used ApacheBench [1] to re- 
peatedly download a 50KB webpage. For MySQL, we 
used the SysBench-simple workload from the previous 
subsection. Both ApacheBench and SysBench are used 
by the server developers themselves. We made these 
benchmarks CPU bound by fitting the web or database 
in memory and by connecting the server and client via a 
1 Gbps LAN. For PBZip2, we decompressed a 10 MB 
file. For SPLASH2 programs, we ran them typically for 
10-100 ms. We measured the execution time for batch 
programs and the throughput (TPUT) and response time 
(RESP) for server programs. All numbers reported in 
this section were averaged over 50 runs. 

The most performance-critical component is the re- 
player because it operates during the normal execu- 
tion of a program. Figure 11 shows the relative over- 
head of the replayer over nondeterministic execution, 
the smaller the better. For seven out of the fourteen 
programs, the replayer performed almost identically to 
nondeterministic execution. For PBZip2 and barnes, 
TERN performed better. This speedup came partially 
from the optimization to remove unnecessary synchro- 
nizations, discussed in the next paragraph. TERN’s 
overhead for MySQL, volrend, raytrace, water-nsquared, 
and choleskey is relatively large because these pro- 
grams performed many synchronization operations over 
a short period of time. For instance, water-nsquared 
and cholesky both call pthread_mutex-_lock () and 
pthread_mutex_unlock () ina tight loop. 

We also measured the effects of skipping unneces- 
sary synchronizations ($7.2). Figure 12 shows the re- 
sults. This optimization significantly reduced the re- 
player’s overhead for four programs. Specifically, it 
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Figure 12: Overhead reduction by skipping unnecessary syn- 
chronizations. “no opt” indicates the baseline overhead. 
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Figure 13: Optimizations to speed up constraint checking. 
Note the y-axis is broken. “no opt” indicates the baseline con- 
straint checking time. “simplify” refers to the optimization in 
§7.3. “slice” refers to the optimization in §7.4. 


made PBZip2 and barnes run faster than nondetermin- 
istic execution, and reduced the overhead of water- 
nsquared from 172.4% to 39.1%. Its effects on the other 
programs are negligible and thus not shown. 

To reuse a schedule on an input, TERN must check the 
input against memoized constraints. Constraint check- 
ing can be costly, and TERN provides two optimizations 
to speed it up (87.3 and 87.4). Figure 13 shows these op- 
timizations can effectively speed up constraint checking 
for Apache, fft, lu, and radix. In particular, they reduced 
the constraint checking time for lu by 16x. 

Compared to the replayer, the memoizer can run of- 
fline, thus its performance is not as critical. Table 6 
shows that this slowdown can sometimes exceed 200x. 
The main reason is that KLEE, the symbolic engine used, 
interprets programs instead of running them natively. An 
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Program Nondet Memoization Overhead (times) 
Apache-TPUT 462.2 req/s 2.1 req/s 219.1 
Apache-RESP 0.22 s 3.96 s 17.0 
MySQL-TPUT — 13779.3 req/s 172.2 req/s 79.0 
MySQL-RESP 0.6 ms 61 ms 100.6 
PBZip2 0.18 s 15.19 s 83.4 


Table 6: Overhead of the memoizer. 





Program Error Description 

Apache Reference count decrement and check against 
0 are not atomic. 

PBZip2 Variable fifo is used in one thread after be- 
ing freed by another. 

fft initdonetime and finishtime are read 
before assigned the correct values. 

lu Variable rf is read before assigned the correct 
value. 

barnes Variable t rackt ime is read before assigned 


the correct value. 


Table 7: Concurrency errors used in evaluation. 


instrumentation-based approach can greatly reduce this 
slowdown [16], which we plan to implement in our fu- 
ture work. 


8.4 Determinism 


We evaluated TERN’s determinism via three sets of ex- 
periments. The first set checked the memoized schedules 
for races (§8.4.1). The second evaluated TERN’s abil- 
ity to deterministically reproduce or avoid bugs (88.4.2). 
The third measured how deterministic memory accesses 
are with and without TERN (88.4.3). 


8.4.1 Race Detection Results 


When memoizing schedules for each of the 14 programs, 
we turned on TERN’s race detector. We found that except 
for radix and cholesky, the schedules TERN memoized 
for all other programs were free of schedule races and 
symbolic races with respect to the symbolic data we an- 
notated (§7.1). Our race detection result is not surprising 
because most schedules are indeed race free. It implies 
that, for runs that reuse the memoized schedules of all 
programs but radix and cholesky, TERN ensures deter- 
minism, barring the assumption discussed in 87.1. 


8.4.2 Bug Determinism 


We also evaluated how deterministically TERN could re- 
produce or avoid bugs. Table 7 lists five real concur- 
rency bugs we used. We selected them because they were 
frequently used in previous studies [37, 39, 43, 44] and 
we could reproduce them on our evaluation machine. To 
measure bug determinism, we first memoized schedules 
for programs listed in Table 7. We then manually inserted 
usleep () to these programs to get alternate schedules. 
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Program Length Nondet TERN _ Ratio 
Apache 148,058 86,215 10,821 7.97 
PBZip2 1,234 161 69 2:33 


Table 8: Memory access determinism. We traced memory ac- 
cessed only from PBZip2, not the external BZip2 library. 


We then ran the buggy programs again, reusing the mem- 
oized schedules. We also injected random delays into the 
reuse runs to perturb timing. We found that, TERN con- 
sistently reproduced or avoided all five bugs. We verified 
this result by inspecting the memoized schedules. 


8.4.3. Memory Access Determinism 


TERN enforces synchronization orders, which should 
make memory access orders more deterministic. We 
quantified this effect over Apache and PBZip2. Specif- 
ically, we instrumented Apache with LLVM to trace ac- 
cesses to global variables and the heap, a crude approxi- 
mation of shared memory. We ran Apache with TERN to 
serve five HTTP requests and collected a trace of mem- 
ory accesses. We then repeated this experiment 20 times 
to collect 20 traces, and computed the average pairwise 
edit distance [52]. We then measured the same edit dis- 
tance for Apache in nondeterministic execution mode 
and compared the two. We did the same comparison 
for PBZip2 with a decompression workload of 2MB. Ta- 
ble 8 shows the result. For Apache, runs with TERN were 
7.97 times more deterministic than those without. For 
PBZip2, TERN was 2.33 times more deterministic, but 
the memory trace had only 1,234 accesses on average. 


9 Related Work 


Deterministic Execution TERN differs from existing 
DMT systems [13, 22, 41] by making threads stable, i.e., 
repeating familiar behaviors across different inputs. An- 
other difference is that TERN reduces timing nondeter- 
minism for server programs through windowing. 

The closest system to TERN in this category is 
Kendo [41], a software-only DMT system that also en- 
forces synchronization orders instead of memory ac- 
cess orders for efficiency. COREDET [13] is another 
software-only DMT system that enforces deterministic 
memory access orders. Both systems are based on log- 
ical clocks and have been shown to work on scien- 
tific benchmarks, such as SPLASH2. The authors of 
COREDET have noted that a small modification to the 
original program leads to a much different COREDET- 
instrumented program, which the idea of schedule mem- 
oization may address. COREDET is a software imple- 
mentation (with extensions) of DMP [22], a hardware 
DMT system . 

Grace [14] proposes a novel approach to making C and 
C++ programs with fork-join parallelism behave like se- 


quential programs. It runs each thread within a process 
and commits memory writes atomically and determin- 
istically. It detects memory access conflicts efficiently 
using hardware page protection. Grace has been shown 
to perform and scale well on Phoenix benchmarks [45] 
and a Cilk [15] benchmark. Unlike Grace, TERN aims to 
make general multithreaded programs, not just fork-join 
programs, deterministic and stable. 

Deterministic Replay Deterministic replay [9, 23, 24, 
27, 31, 33, 34, 40, 44, 50, 51] aims to replay the exact 
recorded executions, whereas TERN “replays” memoized 
schedules on different inputs. Some recent deterministic 
replay systems include Scribe, which tracks page owner- 
ship to enforce deterministic memory access [34]; Capo, 
which defines a novel software-hardware interface and 
a set of abstractions for efficient replay [40]; PRES and 
ODR, which systematically search for a complete exe- 
cution based on a partial one [9, 44]; and SMP-ReVirt, 
which uses clever page protection trick for recording the 
order of conflicting memory accesses [24]. 
Concurrency Errors The complexity in developing 
multithreaded programs has led to many concurrency er- 
rors [39]. A significant number of them are not data 
races, but atomicity and order errors [39], which can be 
deterministically reproduced or avoided using only syn- 
chronization orders. 

Much work exists on concurrency error detection [25, 
37, 38, 47, 55, 56], diagnosis [42, 43, 48], and correc- 
tion [32, 53]. TERN aims to make the executions of 
multithreaded programs deterministic and stable, and is 
complementary to existing work on concurrency errors. 
Specifically, TERN can use existing work to detect and 
fix the errors in the schedules it selects. Moreover, even 
for programs free of concurrency errors, TERN still pro- 
vides value by making their behaviors repeatable. 
Symbolic Execution The combination of symbolic and 
concrete executions has been a hot research topic. Re- 
searchers have built scalable and effective symbolic ex- 
ecution systems to detect errors [16—18, 20, 28-30, 49, 
54], block malicious inputs [21], and preserve privacy in 
error reports [19]. Compared to these systems, TERN ap- 
plies symbolic execution to a new domain: tracking input 
constraints to reuse schedules. 


10 Conclusion 


We have presented TERN, the first DMT system that 
makes general multithreaded programs stable by repeat- 
ing the same schedules on different inputs. TERN does 
so using schedule memoization: if a schedule is shown to 
work on an input, TERN memoizes the schedule; if a sim- 
ilar input arrives later, TERN simply reuses the memo- 
ized schedule. TERN is also the first DMT system to mit- 
igate input timing nondeterminism for server programs. 
Our TERN implementation runs on Linux. It requires 
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no new hardware, no modifications to the underlying OS 
or synchronization library, and only a few lines of mod- 
ifications to the multithreaded programs. We evaluated 
TERN on a diverse set of real programs, including two 
server programs, one desktop program, and 11 scien- 
tific programs. Our results show that TERN is easy to 
use, makes programs more deterministic and stable, and 
has reasonable overhead. TERN is the first DMT sys- 
tem shown to work on applications as large, complex, 
and nondeterministic as MySQL and Apache. It demon- 
strates that DMT has the potential to be deployed today. 
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Abstract 


The Internet has allowed collaboration on an unprece- 
dented scale. Wikipedia, Luis Von Ahn’s ESP game, and 
reCAPTCHA have proven that tasks typically performed 
by expensive in-house or outsourced teams can instead be 
delegated to the mass of Internet computer users. These 
success stories show the opportunity for crowd-sourcing 
other tasks, such as allowing computer users to help each 
other answer questions like “How do I make my com- 
puter do X?”. The current approach to crowd-sourcing IT 
tasks, however, limits users to text descriptions of task so- 
lutions, which is both ineffective and frustrating. We pro- 
pose instead, to allow the mass of Internet users to help 
each other answer how-to computer questions by actually 
performing the task rather than documenting its solution. 

This paper presents KarDo, a system that takes as input 
traces of low-level user actions that perform a task on in- 
dividual computers, and produces an automated solution 
to the task that works on a wide variety of computer con- 
figurations. Our core contributions are machine learning 
and static analysis algorithms that infer state and action 
dependencies without requiring any modifications to the 
operating system or applications. 


1 Introduction 


Computer systems are becoming increasingly complex. 
As a result, users regularly encounter tasks that they do 
not know how to perform such as configuring their home 
router, removing a virus, or creating an email account. 
Many users do not have technical support, and hence 
their first, and often only, resort is a web search. Such 
searches, however, often lead to a disparate set of user fo- 
rums written in ambiguous language. They rarely make 
clear which user configurations are covered by a par- 
ticular solution; descriptions of different problems over- 
lap; and many documents contain conjectured solutions 
that may not work. The net result is that users spend 
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hours manually working through large collections of doc- 
uments to try solutions that often fail to help them per- 
form their task. 

What a typical user really wants is a system that auto- 
matically performs the task for him, taking into account 
his machine configuration and global preferences, and 
asking the user only for information that cannot be au- 
tomatically pulled from his computer. Today, however, 
automation requires experts to program scripts. This pro- 
cess is slow and expensive and hence unlikely to scale to 
the majority of tasks that users perform. For instance, 
a recent automation project at Microsoft succeeded in 
scripting only about 150 of the hundreds of thousands 
of knowledge-base articles in a period of 6 months [10]. 

This paper introduces KarDo, a system that enables 
the mass of Internet users to automate computer tasks. 
KarDo aims to build a database of automated solutions 
for computer tasks. The key characteristic of KarDo is 
that a user contributes to this database simply by perform- 
ing the task. For lay users this means interacting with 
the graphical user interface, which manifests itself as a 
stream of windowing events (i.e., keypresses and mouse 
clicks). KarDo records the windowing events as the user 
performs the task. It then merges multiple such traces 
to produce a canonical solution for the task which en- 
codes the various steps necessary to perform the task on 
different configurations and for different users. A user 
who comes across a task he does not know how to per- 
form searches the KarDo database for a matching solu- 
tion. The user can either use the solution as a tutorial that 
walks him through how to perform the task step by step, 
or ask KarDo to automatically perform the task for him. 

The key challenge in automating computer tasks based 
on windowing events is that events recorded on one ma- 
chine may not work on another machine with a differ- 
ent configuration. To address this problem, a system 
needs to understand the dependencies between the sys- 
tem state and the windowing events. While the system 
could track these dependencies explicitly by modifying 
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Figure 1: Mlustration of KarDo’s three-stage design. 


the OS and potentially applications [18], such an ap- 
proach presents a high deployment barrier and is hard to 
use for tasks that involve multiple machines (e.g., config- 
uring a wireless router). KarDo therefore adopts an ap- 
proach that implicitly infers system state dependencies, 
and does not require modifying the OS or applications. 
In particular, KarDo builds a model that maps window- 
ing events to abstract actions that capture impact on sys- 
tem state: UPDATE and COMMIT actions, which actu- 
ally modify system state, and NAVIGATE actions, which 
simply open or close windows but do not modify system 
state. KarDo performs this mapping automatically us- 
ing machine learning. It then runs a set of static analysis 
algorithms on these sequences of abstract actions to pro- 
duce a canonical solution which can perform the task on 
various different configurations. The system operates in 
3 stages, described below and shown in Fig. 1. 


(a) Abstraction. KarDo first captures the context around 
each windowing event (e.g, the associated application, 
window, widget efc.) using the accessibility interface, 
which was originally developed for visually impaired 
users and is supported by modern operating systems [8, 
5]. KarDo then extracts from the context a per event fea- 
ture vector, which it uses in a machine learning algorithm 
to map the event to the corresponding abstract action. 
Fig. 1(a) illustrates this operation. 


(b) Generalization. KarDo then performs static analysis 
on the abstract actions in each recorded trace to elimi- 
nate irrelevant actions that do not affect the final system 
state. Once it has the relevant actions for each task, it 
proceeds to generalize them to deal with diverse config- 
urations. Since navigation actions do not update state, 
KarDo can learn the many diverse ways to navigate the 
GUI from totally unrelated tasks, and therefore builds a 
global navigation graph across all tasks. In contrast, for 
state-modifying actions (i.e., UPDATES and COMMITS), 
KarDo uses differences across recordings of the same 
task to learn the different sequences of state-modifying 
actions that perform the task on various configurations, 
and represents this knowledge as a per task directed graph 
parameterized by configuration. Fig. 1(b) illustrates the 
generalization stage. 


(c) Replay. In order to perform the task in a specific 
environment, KarDo walks down the graph of state- 
modifying actions trying to find a branch where all the 
actions involve applications (i.e. Thunderbird, Firefox, 
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etc.) that exist on the machine. Once it finds such a 
branch, it proceeds to execute the actions along it. It 
moves from one state modifying action to the next by 
leveraging the global navigation graph to find a path from 
one of the active desktop widgets to the widget corre- 
sponding to the next state-modifying action. Fig. 1(c) 
illustrates the replay stage. 


We built a prototype of KarDo as a thin client con- 
nected to a cloud-based service. We evaluate KarDo on 
57 computer tasks drawn from the Microsoft Help web- 
site [9] and the eHow [4] websites which together include 
more than 1000 actions and include tasks like configuring 
a firewall, web proxy, and email. We generate a pool of 
20 diversely configured virtual machines which we sep- 
arate into 10 training VMs and 10 test VMs. For each 
task, two users performed the task on two randomly cho- 
sen VMs from the training set. We then attempt to per- 
form the task on the 10 test VMs. Our results show that a 
baseline that tries both user traces on each test VM, and 
picks whichever works better, succeeds in only 18% of 
the cases. In contrast, KarDo succeeds on 84% of the 
500+ VM-task pairs. Thus, KarDo can automate com- 
puter tasks across a wide variety of configurations with- 
out modifying the OS or applications. 

We also performed a user study on 5 different com- 
puter tasks, to evaluate how well KarDo performs com- 
pared to humans for the same set of tasks. Even with 
detailed instructions from our lab website the students 
failed to correctly complete the task in 20% of the cases. 
In contrast, when given traces from all 12 users, KarDo 
produced a correct canonical solution which played back 
successfully on a variety of different machines. 


2 Challenges 


A system that aims to automate computer tasks based on 
user executions and without instrumenting the OS or ap- 
plications, needs to attend to multiple subtleties. 


(a) Generalizing Navigation. Consider the task of con- 
figuring a machine for access through remote desktop. 
On Microsoft Windows, the first step is to enable remote 
desktop on the local machine through the “System” dia- 
log box which is accessed through the Control Panel. Au- 
tomatically navigating to this dialog box can be difficult 
however because the Control Panel can be configured in 
three different ways. Novice users typically retain the de- 
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(b) Category View 
Figure 2: Diverse Configurations. To enable remote desktop one 
must go to the “System” dialog box. Depending on the configuration 
of the Control Panel, one can either directly click the “System” icon (a) 
or must first navigate to “Performance and Maintenance” (b) then click 
the “System” icon. 


fault view which uses a category based naming scheme, 
as in Fig. 2(a). Most advanced users however switch to 
the “Classic View” which always shows all available con- 
trols, as in Fig. 2(b). And, efficiency oriented users often 
go as far as configuring the control panel so it appears as 
an additional menu off of the start menu. All three paths 
however lead to the same “System” dialog box where one 
can turn on remote desktop. The challenge is to produce 
a canonical GUI solution that performs the task on ma- 
chines with any of these configurations even when the 
recorded traces for this task show only one of the possi- 
ble configurations. 


(b) Filtering Mistakes and Irrelevant Actions. KarDo 
needs a mechanism to detect mistakes and eliminate ir- 
relevant actions that are not necessary for the task. For 
example, while performing a task, the user may acciden- 
tally open some program that turns out to not be relevant 
for the task. If this mistake is included in the final so- 
lution, however, it will require the playback machine to 
have this irrelevant program installed in order for KarDo 
to automatically perform the task. It is important to re- 
move mistakes like this to prevent the need for the user 
to rerecord a second “clean” trace, thus allowing users to 
generate usable recordings as part of their everyday work. 


(c) Parameterizing Replay. After enabling remote desk- 
top on his local machine, the user needs to configure the 
router to allow through the incoming remote desktop con- 
nections and direct them to the right machine. KarDo can 
easily automate a task like this, since it is done through 
a web-browser interface to the router, which provides the 
same accessibility information as all other GUI applica- 
tions. The challenge arises, however, because one user 
may have a static IP address while another has a dy- 





namic IP address, or worse, one user might have a DLink 
router, while another has a Netgear. Different steps are 
required to perform this task if the user has a static IP 
address vs. a dynamic IP address. Similarly, different 
routers present different web-based configuration inter- 
faces, so users with different routers need to perform dif- 
ferent GUI actions to perform this task. KarDo needs 
to retain each of these paths in the final canonical so- 
lution, and parametrize them such that the appropriate 
path can be chosen during playback. The challenge is 
to distinguish these configuration based differences from 
mistakes and irrelevant actions so that the former can be 
retained while the later are removed. 


(d) User-Specific Entries. Some tasks require a user to 
enter his name, password, or other user-specific entries. 
KarDo can easily avoid recording passwords by recog- 
nizing that the GUI naturally obfuscates them, provid- 
ing a simple heuristic to identify them. However, KarDo 
also needs to recognize all other entries that are user spe- 
cific and distinguish them from entries that differ across 
traces because they are mistakes or configuration-based 
differences. It is critical to distinguish user specific en- 
tries from mistakes and configuration differences because 
KarDo should ask the user to input something like his 
username, while it should automatically discover which 
path to follow for different router manufacturers. 


3 KarDo Overview 


KarDo is a system that enables end users to automate 
computer tasks without programming, and does not re- 
quire modifications to the OS or applications. It has two 
components, a client that runs on the user machine to 
do recording and playback, and a server that contains a 
database of solutions. 

When a user performs a task that he thinks might be 
useful to others, he asks the KarDo client to record his 
windowing events while he performs the task. If the user 
cannot, or does not want to perform the task on his ma- 
chine, he can perform the task remotely on a virtual ma- 
chine running on the KarDo server, while KarDo records 
his windowing events. In either case, when the user is 
done, the client uploads the resulting windowing event 
trace to the KarDo server. The server asks the user for 
a task name and description. It uses this information to 
search its database for similar tasks and asks the user if 
his task matches any of those. This ensures that all traces 
for the same task are matched together. 

When a user encounters a task he does not know how 
to perform, he searches the KarDo database for a solu- 
tion. KarDo’s search algorithm has access to not only the 
information that a normal text search would have, such 
as the task’s name and description, the steps of the task, 
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and the text of each relevant widget, but also system level 
information like which programs are installed, and which 
GUI actions he has taken recently. As a result, we believe 
that task search with KarDo can be much more effective 
than standard text searching is today. However, effective 
search represents a research paper on its own, and so we 
leave the search algorithm details to future work. 

The user can either use the solution as a tutorial that 
will walk him through how to perform the task step by 
step, or allow the solution to automatically perform the 
task for him. It is important to recognize however that 
KarDo’s solutions are intended to be best-effort. Even a 
highly evolved system will not be able to automate cor- 
rectly all of the time. Thus, KarDo takes a Microsoft Vir- 
tual Shadow Service snapshot before automatically per- 
forming any task, and immediately rolls back if the user 
does not confirm that the task was successfully performed 
(as discussed in §8, however, we leave the security as- 
pects of this problem to future work). 

The next three sections detail the three steps for trans- 
forming a set of traces recorded on one set of machines 
into a solution which allows automated replay on any 
other machine. §4 covers how to record the windowing 
events and map them to abstract actions that highlight 
how each action affects the system state. §5 then de- 
scribes how to merge together multiple such sequences 
of abstract actions to create a generalized solution for any 
configuration. Finally, 86 discusses how replay utilizes 
the generalized solution and the state of the playback ma- 
chine to determine the exact set of playback steps appro- 
priate for that machine. 


4 Windowing Events to Abstract Actions 


The first phase of generating a canonical solution from 
a set of traces is to transform a windowing event trace 
into a sequence of abstract actions, since the generaliza- 
tion phase, discussed in §5 works over abstract actions. 
Performing this abstraction requires first converting the 
trace to a sequence of raw GUI actions by associating 
GUI context information with each windowing event, and 
then mapping raw GUI actions to abstract actions using a 
machine learning classifier. 


4.1 Capturing GUI Context 


A low-level windowing event contains only the specific 
key pressed, or the mouse button click along with the 
screen location. Effectively mapping these low-level 
events to abstract actions requires additional information 
about the GUI context in which that event took place such 
as which GUI widget is at the screen location where the 
mouse was clicked. KarDo gathers this information using 
the Microsoft Active Accessibility (MSAA) interface [8]. 
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Developed to enable accessibility aids for users with im- 
paired vision, the accessibility interface has been built 
into all versions of the Windows platform since Windows 
98 and is now widely supported [8]. Apple’s OS X al- 
ready provides a similar accessibility framework [7], and 
the Linux community is working to standardize a single 
accessibility interface as well [5]. The accessibility in- 
terface provides information about all of the visible GUI 
widgets, including their type (button, list box, etc.), their 
text name, and their current value, among other charac- 
teristics. It also provides a naming hierarchy of each wid- 
get which we use to uniquely name the widget. KarDo 
uses this context information to transform each window- 
ing event to a raw GUI action performed on a particular 
widget. An example of such a raw GUI action is a left 
click on the OK button in the Advanced tab in the “In- 
ternet E-mail Setting” window. 


4.2 Abstract Model 


KarDo uses an abstract model for GUI actions. This 
model captures the impact that each action has on the 
underlying system state. We do not claim that our model 
captures all possible applications and tasks, however, it 
does capture common tasks (e.g., installation, configura- 
tion changes, network configurations, e-mail, web tasks) 
performed on typical Windows applications (e.g., MS Of- 
fice, IE, Thunderbird, FireFox) as shown from the 57 
evaluation tasks in Table 3. As discussed in 812, it also 
can be extended if important non-compliant tasks or ap- 
plications arise. 

In the abstract model all actions are performed on wid- 
gets. A widget could be a text box, a button, etc. There 
are three types of abstract actions in KarDo’s model: 


UPDATE Actions: These actions create a pending change 
to the system state. Examples of UPDATE actions include 
editing the state of an existing widget, such as typing into 
a text box or checking a check-box, and adding or remov- 
ing entries in the system state, e.g., an operation which 
adds or removes an item from a list-box. 


COMMIT/ABORT Actions: These actions cause pending 
changes made by UPDATE actions to be written back into 
the system state. An example of a COMMIT action is 
pressing the OK button, which commits all changes to all 
widgets in the corresponding window. An ABORT action 
is the opposite: it aborts any pending state changes in the 
corresponding window, e.g., pressing a Cancel button. 


NAVIGATE Actions: These change the set of currently 
visible widgets. NAVIGATE actions include opening a di- 
alog box, moving from one tab to another, or going to the 
next step of a wizard by pressing the Next button. 

Note that a single raw GUI action may be converted 
into multiple abstract actions. For example, pressing the 
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Click Open Dialog 
Check Check Box 
Click OK 


Navigate to Dialog; 
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idget,) 
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Abort (Dialog,, Widget,) 
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Figure 3: A simplified illustration mapping raw GUI action to the 
corresponding abstract actions. 


OK button both commits the pending states in the corre- 
sponding window and navigates to a new view. 


Fig. 3 illustrates a simple sequence of raw GUI actions 
and the corresponding abstract actions. Here a user clicks 
to open a dialog box, clicks to check a check box, and 
then clicks OK. He then realizes that he made a mistake 
and opens the dialog again to uncheck the check box. Fi- 
nally, he opens the dialog one last time, rechecks the box, 
but reconsiders his change and hits the Cancel button. 
The corresponding sequence of abstract actions shows 
that the user navigated thrice to the dialog box, updated 
the check box, committed or aborted the UPDATE, and 
navigated again to the main window. However, the ab- 
stract model allows us to reason that the first UPDATE 
and the corresponding NAVIGATE and COMMIT actions 
are overwritten by the later UPDATE and hence are re- 
dundant and can be eliminated. Similarly, since the last 
UPDATE and associated ABORT do not update the state, 
they too can be eliminated. In 85.1, we describe KarDo’s 
static analysis algorithm for filtering out such mistakes. 


4.3. Mapping to Abstract Actions 


KarDo has to label the raw GUI actions returned by 
the accessibility interface as UPDATE, COMMIT, and/or 
NAVIGATE. It does not attempt to explicitly classify 
ABORT actions because KarDo’s algorithms implicitly 
treat the lack of a COMMIT action as an ABORT action 
as explained in 85.1. Further, a given action can have 
multiple different abstract action labels, or not have any 
label at all. KarDo performs the labeling as follows. 


To label an action as a NAVIGATE action, KarDo uses 
the simple metric of observing whether new widgets be- 
come available before the next raw action. Specifically, 
KarDo’s recordings contain information about not only 
the current window, but all other windows on the screen. 
Thus, if an action either changes the available set of wid- 
gets in the current window, or opens another window, 


Widget name (typically the text on the widget) 
: Widget role (i.e., button, edit box, etc. 
Widget/ 8 : ( 3 ) 
: Does the widget contain a password? 
Window : : 
Pestirke Is the widget updatable (i.e., check box, etc.)? 
. Is the widget in a menu with checked menu items? 
Does the window contain an updatable widget? 
Did the action cause a window to close? 
Response Did the action cause a window to open? 
To Action Did the action generate an HTTP POST? 
Features Did the action cause the view to change? 
Did the action cause the view state to change? 
Action type (right mouse click, keypress, etc.) 
: Keys pressed (the resulting strin 
Action ye P ( os te 8) ” 
Does the keypress contain an “enter”? 
Features 3 : 
Does the keypress contain alpha numeric keys? 
Is this the last use of the widget? 











Table 1: SVM Classifier Features. This table shows the list of fea- 
tures used by the SVM classifier to determine which actions are UP- 
DATE and COMMIT actions. All features are encoded as binary features 
with multi-element features (such as widget name) encoded as a set of 
binary features with one feature for each possible value. 


then KarDo labels that action as a NAVIGATE action.! 
Labeling an action as a COMMIT or UPDATE action is 
not as straightforward. There are cases where this label- 
ing is fairly simple; for example, typing in a text box or 
checking a check box is clearly an UPDATE action. But to 
handle the more complex cases, KarDo approaches this 
problem the same way a user would, by taking advantage 
of the visual features on the screen. For example, a typ- 
ical feature of a COMMIT action, is that it is associated 
with a user clicking a button whose text comes from a 
small vocabulary of words like {OK, Finish, Yes}. 
KarDo does this labeling using a machine learning 
(ML) classifier. Specifically, an ML classifier for a given 
class takes as input a set of data points, each of which is 
associated with a vector of features and produces as out- 
put a label for each data point indicating whether or not 
it belongs to that class. It does this labeling by learning a 
set of weights which indicate which features, and which 
combinations of features, are likely to produce a posi- 
tive data point, and which are likely to produce a nega- 
tive data point. KarDo uses a supervised classifier, which 
does this learning based on a small set of training data. 
KarDo uses two separate classifiers, one for COM- 
MITS and one for UPDATES. These classifiers take as 
input a data point for each user action (i.e., each mouse 
click or keypress), and label them as UPDATES and COM- 
MITS respectively. 7 Table 1 shows the features used by 
KarDo’s classifiers to determine the labels. Features such 
as widget name, and widget role cannot be used directly 
by the classifiers however, because classifiers only work 
with numerical features. Thus, KarDo handles features 


'KarDo will also label a window close as a NAVIGATE action in 
cases like a modal dialog box, where the user cannot interact with the 


underlying window again until the dialog box is closed. 
2Note that since a given action is fed to both classifiers it can be 


classified as both an UPDATE and a COMMIT to account for actions 
like clicking the “Clear Internet Cache” button which both update the 


state and immediately commit that update. 
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like these, which are character strings, using the same 
technique as the Natural Language Processing commu- 
nity. Specifically, it adds a new binary feature for each 
observed string, i.e., is the the widget name “OK”, is 
the widget name “Close”, etc. This creates a relatively 
large number of features for each action which can cause 
a problem called overfitting, where the classifier works 
well only on the training data set, and it does not gen- 
eralize to new data. To handle this large number of fea- 
tures, KarDo uses a type of classifier called a Support 
Vector Machine (SVM) which is robust to large num- 
bers of features because it uses a technique called margin 
maximization. KarDo trains the SVM classifier using a 
set of training data from one set of traces, while all testing 
is done using a distinct set of traces. 


5 Generalization 


Generalization starts with multiple abstract action traces 
which perform the same task on different configurations 
and transforms them into a single canonical solution that 
performs the task on all configurations. KarDo performs 
this step by separating how it handles NAVIGATE actions 
from how it handles state modifying actions, i.e. UP- 
DATES and COMMITS. Specifically, it first prunes out all 
NAVIGATE actions from each trace (and all unlabeled ac- 
tions), leaving only the state modifying actions. It then 
follows a three step process to generate a canonical solu- 
tion: (1) it runs a static analysis algorithm on each pruned 
trace that removes all the mistakes and irrelevant UP- 
DATES; (2) these simplified traces are merged together 
to create a single canonical trace which is parameterized 
by user-specific environment; and (3) the NAVIGATE ac- 
tions from all traces for all tasks are utilized to create a 
global navigation graph which is used to do navigation 
during playback. The rest of this section describes these 
three steps in detail. 


5.1 Filtering Mistakes 


The first step of generalization is to filter out mistakes 
from each trace. To understand the goal of filtering 
out mistakes, consider the example in Fig. 3, where the 
user opens the dialog box multiple times, changing the 
value of a given widget each time. In this example, the 
first check box UPDATE is overwritten by the second, 
while the third is never committed. Thus both of these 
UPDATES are unnecessary, and they should be removed 
along with the opening and closing of the dialog box as- 
sociated with them. Their removal is important for two 
reasons. First, if a user chooses to read the text version of 
a solution, or to have KarDo walk him through the task, 
then such mistakes will be confusing to the user. Sec- 
ond, if not removed, mistakes like this can be confused as 
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Figure 4: A Two-Pass Algorithm to Remove Mistakes. 
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user-specific or environment-specific actions and hence 
limit our ability to generalize. 


The naive approach to identifying mistakes would 
compare multiple GUI traces from users who performed 
the same task, and consider differing actions as mistakes. 
Unfortunately, such an approach will also eliminate nec- 
essary actions which differ due to differences in users’ 
personal information (e.g., printer name) or their work- 
ing environment (e.g., different wireless routers). 


In contrast, the key idea in KarDo is to recognize that 
the difference between unnecessary actions and environ- 
ment specific actions is that unnecessary actions do not 
affect the final system state, and GUIs are merely a way 
of accessing this system state. So KarDo tracks the state 
represented by each widget and keeps only actions that 
affect the final state of the system. It does this using the 
following two-pass static analysis algorithm that resem- 
bles the algorithms used in various log recovery systems 
to determine the final set of committed UPDATES. 


Pass 1 - Filtering Out Unnecessary UPDATES: The first 
pass removes all UPDATES on a particular widget except 
the last UPDATE which actually gets committed. Specif- 
ically, consider again our example from Fig. 3 where a 
user opens a given dialog box, and modifies a widget 
three times. We can see that KarDo needs to recognize 
that the second UPDATE overwrote the first UPDATE, ren- 
dering the first unnecessary. However, it cannot blindly 
take the last UPDATE, because the final UPDATE was 
aborted. Thus KarDo needs to keep the final committed 
UPDATE for each widget. It does this by walking back- 
wards through the trace maintaining both a list of out- 
standing COMMITS, and a list of widgets for which it’s 
already seen a committed UPDATE. As it walks back- 
wards, it removes both UPDATES without outstanding 
COMMITS and UPDATES for which it’s already seen a 
committed UPDATE on that same widget. 


Pass 2: Filtering Out Unnecessary COMMITS: The 
second pass removes COMMITS with no associated UP- 
DATES. It does this by walking forwards through the 
trace maintaining a set of pending UPDATES. When it 
reaches an UPDATE, it adds the affected widget to the 
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pending set. When it reaches a COMMIT, if there are any 
widget(s) associated with this COMMIT in the pending 
set, it removes them from the pending set, otherwise it 
removes the COMMIT from the trace. 

One may fear that there are cases in which having the 
system go through an intermediate state is necessary even 
if that state is eventually overwritten. For example, if 
the task involves disabling a webserver, updating some 
configuration that can only be modified when the web- 
server is disabled and then re-enabling the webserver, 
it would be incorrect to remove the disabling and re- 
enabling of the webserver. While in theory such prob- 
lems could arise, we find that in practice they do not 
arise. This is because actions like enabling and disabling 
a webserver typically look to KarDo like independent 
UPDATES which do not reverse each other, since one 
may require clicking the “disable” button while the other 
requires clicking the “enable” button. This causes the 
mistake removal algorithm to be somewhat conservative, 
which is the appropriate bias since it’s worse to remove 
a required action than to leave a couple of unnecessary 
actions. 


5.2 Parametrization 


The second step of generalization is to parameterize the 
traces. Specifically, now that we have removed mis- 
takes and navigation actions, the remaining differences 
between traces of the same task are either user specific 
actions (e.g. user name), or machine configuration dif- 
ferences (static IP vs. dynamic IP) which change the set 
of necessary UPDATE or COMMIT actions. To integrate 
these differences into a canonical trace that works on all 
configurations KarDo parametrizes the traces as follows: 


(a) Parametrize UPDATES. The values associated with 
some UPDATE actions, such as usernames and pass- 
words, are inherently user specific and cannot be auto- 
mated. KarDo identifies these cases by recognizing when 
two different traces of the same task update the same wid- 
get with different values. To handle these kinds of Up- 
DATES, KarDo parses all traces of a task to find all unique 
values that were given to each widget via UPDATE ac- 
tions that were subsequently committed. Based on these 
values the associated UPDATE actions are marked as ei- 
ther AutoEnter if the associated widget is assigned the 
same value in all traces of that task, or UserEnter if the 
associated widget is assigned a different value in each 
trace. On play back, AutoEnter UPDATES are performed 
automatically, while KarDo will stop play back and ask 
the user for UserEnter actions. Note that if the widget is 
assigned to a few different values, many of which occur 
in multiple traces (e.g., a printer name), KarDo will as- 
sign it PossibleAutoEnter, and on play back let the user 
select among values previously entered by multiple dif- 


ferent users or enter a new value. 


(b) Parameterized Paths. All of the remaining differ- 
ences between traces now stem from configuration dif- 
ferences in the underlying machine, which necessitate a 
different set of UPDATES or COMMITS in order to per- 
form the same task. To handle this type of difference, 
KarDo recognizes that when a user’s actions in two dif- 
ferent traces differ because of the underlying machine 
configuration, the same action will generate two different 
resulting views. For example, consider the task of setting 
up remote desktop. Different traces may have used dif- 
ferent routers, which require different sets of actions to 
configure the router. Since the routers are configured via 
a web browser, opening a web browser and navigating to 
the default IP address for router setup, http://192.168.1.1, 
will take the user to a different view depending on which 
router the user has. KarDo takes advantage of this to rec- 
ognize that if the DLink screen appears, then it must fol- 
low the actions from the trace for the DLink router, and 
similarly for the other router brands. 

Thus, KarDo builds a per-task state-modifying graph 
and automatically generates a separate execution branch 
with the branch point parameterized by how the GUI 
reacts, e.g., which router configuration screen appears. 
This ensures that even when differences in the underly- 
ing system create the need for different sets of UPDATES 
and COMMITS, KarDo can still automatically execute the 
solution without needing help from the user. If the traces 
actually perform different actions even though the under- 
lying system reacts exactly the same way, then these are 
typically mistakes, which would be removed by our fil- 
tering algorithm above. If differences still exist after fil- 
tering, this typically represents two ways of performing 
the same step in the task, i.e. downloading a file using IE 
vs. Firefox. Thus KarDo retains both possible paths in 
the canonical solution and if both are available on a given 
playback machine, then KarDo will choose the path that 
is the most common among the different traces. 


5.3 Building a Global Navigation Graph 


Real world machines expose high configuration diversity. 
This diversity stems from basic system level configura- 
tion like which programs a user puts on their desktop 
and which they put in their Start Menu, to per application 
configuration like whether a user enables a particular tool 
bar in Microsoft Word, or whether they configure their 
default view in Outlook to be e-mail view or calendar 
view. All of these configuration differences affect how 
one can reach a particular widget to perform a necessary 
UPDATE or COMMIT. KarDo handles this diversity with 
only a few traces for each task by leveraging that multiple 
tasks may touch the same widget, and building a single 
general navigation graph using traces for all tasks. 
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Figure 5: Illustration of the Navigation Graph. A simplified illus- 
tration showing a few ways to reach the IE window. The actual graph 
is per widget and includes many more edges. 


Building such a general navigation graph is relatively 
straightforward. KarDo marks each NAVIGATE action as 
an enabler to all of the widgets that it makes available. 
KarDo then adds a link to the navigation graph from the 
widget this NAVIGATE action interacted with (e.g., the 
icon or button that is clicked), to the widgets it made 
available, and associates this NAVIGATE action with that 
edge. Fig. 5 presents a simplified illustration of a por- 
tion of the navigation graph. It shows that one can run IE 
from the desktop, the Start menu, or the Run dialog box. 


6 Replay 


The replay process takes a solution constructed using the 
process described in the preceding sections, and produces 
the low-level window events to perform a task on a par- 
ticular machine. At each step, this process utilizes the 
full navigation graph, the per-task state-modifying de- 
pendency graph, and the current GUI context. 

During replay, KarDo walks down the task’s state- 
modifying dependency graph. As described in 85.2, this 
graph is parameterized by GUI context. Thus, KarDo 
utilizes the current GUI context and the installed applica- 
tions to determine the path to follow at any branch point. 

At each step, KarDo needs to ensure that the next state- 
modifying action is enabled. To enable a given UPDATE/ 
COMMIT action, KarDo finds the shortest directed path 
in the navigation graph between the widget required for 
the UPDATE/COMMIT action, and any widget that is cur- 
rently available on the screen. KarDo finds this path by 
working backwards in the navigation graph. Specifically, 
it first checks to see if the necessary widget is already 
available. If not, it looks in the navigation graph for all 
incoming edges to the necessary widget, and checks to 
see if any of the widgets associated with those edges are 
available. If not, it checks the incoming edges to those 
widgets, etc. It continues this process until either it finds 
a widget which is already available on the screen, or there 
are no more incoming edges to parse. 

Once KarDo’s navigation algorithm finds a relevant 
widget in the navigation graph which is currently avail- 
able on the screen, it performs the associated action. If 
the expected next widget in the graph appears, KarDo 
follows the path through the navigation graph until the 
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widget associated with the necessary UPDATE/COMMIT 
action becomes available. If at any point, the expected 
widget that the edge leads to does not appear, however, 
KarDo marks that navigation edge as unusable, and again 
performs the above search process. * It continues this 
process until either it succeeds in making the necessary 
UPDATE/COMMIT widget appear on the screen, or it has 
exhausted all possibilities and has no paths left in the nav- 
igation graph between widgets currently on the screen 
and the next necessary UPDATE/COMMIT widget. 

Finally, each abstract action, whether NAVIGATE or 
state-modifying, is mapped to a low-level windowing 
event by utilizing the accessibility interface similar to the 
way it is used during recording. 


7 Solution Validation 


When a user uploads a solution for a task, KarDo allows 
the user to provide a solution-check. To do so, the user 
performs the steps necessary to confirm the task has been 
completed correctly and highlights the GUI widget that 
indicates success. For example, to check an IPv6 config- 
uration, the user can go to ipv6.google.com and highlight 
the Google search button. As with standard tasks, KarDo 
will map the trace to abstract actions, clean it from ir- 
relevant actions, etc. Such solution-checks allow KarDo 
to confirm that its canonical solution for a task works on 
all configurations by playing the solution followed by its 
solution-check on a set of VMs with diverse configura- 
tions, and checking that in each VM the highlighted GUI 
widget has the same state as in the solution-check. 


8 Security 


Ensuring that users cannot insert malicious actions into 
KarDo’s solutions is an important topic that represents a 
research paper on its own. We do not attempt to tackle 
that problem in this paper. To handle non-malicious mis- 
takes, however, KarDo takes a Microsoft Virtual Shadow 
Service snapshot before automatically performing a task 
and rolls back if the user is unhappy with the results. 


9 Implementation 


The KarDo implementation has three components: a 
client for doing the recording and the playback, a server 
to act as the solution repository, and a virtual machine 
infrastructure for remote recording and solution testing. 


9.1 Client 


Our current KarDo client is built on Microsoft Win- 
dows as a browser plugin. The user interface runs in the 


3It caches the searched subgraphs to speed up any later searches. 
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browser and is built using standard HTML and Javascript 
which communicate with the plugin to provide all KarDo 
functionality. The plugin is written natively in C++. As 
discussed in §4.1, the plugin uses the OS Accessibility 
API to do the recording and playback. 

The main implementation challenge in the client is to 
ensure that the GUI context of each mouse click and key- 
press can be recorded before the GUI changes as a re- 
sult of the user action, i.e. before the window closes as 
a result of clicking the “OK” button. KarDo achieves 
the timely recording of the GUI context by utilizing the 
Windows Hooks API, which allows registration of a call- 
back function to be called immediately before keypress 
and/or mouse click messages are passed to the applica- 
tion. The challenge is that such a callback function needs 
to be extremely fast, otherwise the UI feels sluggish to 
the user [17]. Calls to MSAA to get the GUI context are 
very slow, however, for two reasons: (1) they use the Mi- 
crosoft Component Object Model (COM)* interface to 
marshal and unmarshal arguments for each function call, 
and (2) MSAA requires a separate call for each attribute 
of each widget on the screen (e.g., a widget name or role) 
often resulting in thousands of COM function calls per 
window. 

We use two main techniques to maintain acceptable 
recording performance. First, we implement the callback 
function in a shared library so that it can run in-process 
with the application receiving the click/keypress. This 
significantly improves performance since it avoids the 
overhead of COM IPC for each function call. Second, 
instead of recording the GUI context of every window 
on the screen with every user input, we record only the 
full context of the window receiving the user input, and 
for all other windows we record only high level informa- 
tion such as the window handle, and window title. As we 
show in §10.4, this significantly improves performance 
when the user has many other windows open. 


9.2 Solution Repository Server 


The solution server provides a central location for upload, 
download and storage of all solutions. In our current im- 
plementation, all solution merging also happens on the 
server. We implement the solution server on Linux us- 
ing a standard Apache/Tomcat server backed by a Post- 
gres database. All solutions are stored on disk, with all 
meta-data stored in the database. When the client fin- 
ishes recording a trace, KarDo immediately asks the user 
if he would like to upload the trace. Upon confirmation, 
the client uploads the trace to the server, and the server 
searches its existing database for solutions with similar 
sets of steps, and asks the user to confirm if his trace 
matches any of these. The server also provides a web in- 


44 binary interface used for inter-process communication 


terface listing all solutions. When a user finds a task they 
would like automatically performed, they click the Play 
button which calls into the client browser plugin to down- 
load that solution from the server and start playback. 


9.3. Virtual Machine Infrastructure 


The VM infrastructure is used for two purposes: 1) to en- 
able users to record a solution for a task which they either 
cannot or do not want to perform on their own machine; 
and 2) to perform solution validation as discussed in §7.° 
KarDo’s VM infrastructure is build on top of Kernel- 
based Virtual Machine (KVM)[6]. Its design is based on 
Golden Master (GM) VM images, which are generic ma- 
chine images that have been configured to expose a cer- 
tain dimension of configuration diversity, or make avail- 
able a certain set of tasks. For example, some GMs are 
configured with static IP addresses, while others have dy- 
namic IP addresses, and some have Outlook as the default 
mail client, while others have Thunderbird. The infras- 
tructure can then quickly bring up a running snapshot of 
any GM by taking advantage of KVM’s copy-on-write 
disks and its memory snapshotting support. 


10 Evaluation 


We evaluate KarDo on 57 computer tasks which together 
include more than 1000 actions and are drawn from the 
Microsoft Help website [9] and the eHow [4] website. 
We chose these tasks by randomly pulling articles from 
the websites and then eliminating those which did not 
describe an actual task (i.e. “What does Microsoft Ex- 
change do?”), those which described hardware changes 
(i.e. “How to add more RAM”), and those which required 
software to which we did not already have a license. We 
focused on common programs, e.g., Outlook, IE, and di- 
versified the tasks to address Web, Email, Networking, 
etc. The full list of tasks is shown in Table 3 and includes 
tasks like configuring IPv6, defragmenting a hard drive, 
and setting up remote desktop. 


10.1 Handling Configuration Diversity 


Our goal with KarDo is to handle the wide diversity of 
ways in which users configure their machines. Measur- 
ing KarDo’s performance on a small number of actual 
user machines is not representative of the wide diversity 
of configurations, however, since many users leave the 
default option for most configuration settings. To capture 
this wide diversity, we generate a pool of 20 virtual ma- 
chines whose configurations differ along the following 
axes: differences of installed applications (e.g., Firefox 


>We also used it to produce the evaluation results in §10.1. 
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Figure 6: Success Rate on Diverse Configurations: For each 
task on the x-axis the figure plots on the y-axis the percentage of test 
VMs that succeeded in performing the task using a specific automation 
scheme. For each scheme, the area under the curve refers to the success 
rate taken over all task-VM pairs. KarDo-Two-Traces has a success rate 
of 84%, whereas KarDo-One-Trace has a success rate of 64%. In con- 
trast, Best-Trace, which tries both of the two traces and picks whichever 
works better, has a success rate of only 18%, and Random-Trace, which 
randomly chooses between the two traces, has a success rate of only 
11%. 

vs. IE, Thunderbird vs. Outlook), differences of per- 
application configuration (e.g., different enabled tool and 
menu bars), user-specific OS configuration (e.g., differ- 
ent views of the control panel, different icons on the desk- 
top), and different desktop states (e.g., different windows 
or applications already opened). We apply each configu- 
ration option to a random subset of the VMs. This results 
in a set of machines with more configuration diversity 
than normal, but which represent the kind of diversity of 
configurations we would like to handle. 

We separate this pool of VMs into 10 training and 10 
test. We recruited a set of 6 different users to help us 
record traces, including 2 non-expert users and 4 com- 
puter science experts. For each of the 57 evaluation tasks, 
two of the six users perform the task on two randomly 
chosen VMs from the training set. We then try to replay 
each task on the 10 test VMs. We compare four schemes: 


e KarDo - Two Traces: We generate a canonical so- 
lution by merging together the two traces for each 
task, and we generate a navigation graph using all 
of the traces from all tasks. We then use the KarDo 
replay algorithm to playback the resulting solutions 
on the test VMs. 

e KarDo - One Trace: We randomly pick one of the 
two traces and use it to generate a canonical solution 
for that task. The navigation graph is generated from 
that trace plus all traces for all other tasks (but not 
the other trace for that same task). 

e BaseLine - Best Trace: For each VM, we try di- 
rectly playing both of the two recorded traces for 
each task. If either trace succeeds then we report 
success for that VM-task combination. This shows 
how well a baseline system would perform with two 
traces per task. 
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e BaseLine - Random Trace: We randomly pick one 
of the two traces and directly playback all of the 
GUI actions in the original trace on the test VMs. 
This represents how well a baseline system would 
perform with only one trace per task. 


Fig. 6 plots the success rate of these four schemes. It 
shows that the Best-Trace approach succeeds on average 
on only 18% of the VMs while the Random-Trace suc- 
ceeds on just 11% of the test VMs. In contrast, KarDo 
succeeds on 84% of the 500+ VM-task pairs when given 
two traces, and on 64% when given only one trace. Thus, 
KarDo enables non-programmers to automate computer 
tasks across diverse configurations. 


10.2 Understanding Baseline Errors 


The Best-Trace and Random-Trace schemes are very 
susceptible to configuration differences. Even a sin- 
gle configuration difference can cause the Random-Trace 
scheme to fail. The two traces considered by the Best- 
Trace approach make it more robust to configuration dif- 
ferences, but it still only works if the test VM looks very 
similar to one of the VMs on which the recordings were 
performed. Consider a case where one recording opened 
Outlook from the desktop, and then accessed a menu item 
to change some configuration, and the other recording 
opened it from the Start Menu, and then used the tool bar 
to change that configuration. Even in this simple case 
where the two recordings see a large amount of diversity 
between them, the Best-Trace algorithm cannot handle a 
case where the tool bars are turned off, but Outlook is not 
on the desktop, or a case where menus are turned off, but 
Outlook is not in the Start Menu. More generally, even 
if the test VM is a hybrid of the two VMs on which the 
traces were recorded, the Best-Trace approach will fail. 
This is because a hybrid configuration requires pulling 
different parts from each of the traces which cannot be 
done without KarDo’s technique of merging the traces 
together. Thus, the Best-Trace approach requires an ex- 
cessive number of examples to successfully play back on 
diverse machines. Finally, we note that there are a num- 
ber of tasks where the Best-Trace fails on all VMs. This 
occurs when all test VMs are widely different from the 
two VMs where the recordings were performed. 


10.3. Understanding KarDo Errors 


While KarDo successfully plays back in the vast major- 
ity of the cases, it still fails to playback successfully on 
16% of the VM-task pairs. There are three main causes 
of these errors: classifier mistakes, incorrect navigation 
steps, and missing navigation steps. Fig. 7 shows the 
breakdown of these errors. Specifically it shows that 
eliminating classification errors results in a 91% success 
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Figure 7: Cause of KarDo Errors: This figure shows the breakdown 
of the KarDo playback errors by showing the playback success when 
various parts of the KarDo algorithms are replaced by oracle versions. 


Recall that the success rate is the area under the curve. Based on the 
figure, replacing KarDo’s classifiers with oracle classifiers increases the 


playback success rate from 84% to 91%. Additionally, eliminating all 
mistakes in the navigation database by using an oracle for navigation 
increases the playback success rate from 91% to 95%. The remaining 
5% failure cases result from missing navigation steps that did not appear 
in any of the input traces. 


rate while eliminating incorrect navigation steps results 
in a 95% success rate. We observe that the remaining 5% 
of the errors result mostly from missing navigation steps. 
The following discusses each of these in detail. 


(a) ML Classification Errors: To evaluate our ML clas- 
sifier, we manually labeled each of the actions performed 
by the users for the 57 tasks as a COMMIT action, an UP- 
DATE action, both or neither. We then split this labeled 
data into half training and half test data. As described 
in §4.3 we run two separate classifiers on the data, one 
for UPDATE actions, and one for COMMIT actions. Since 
KarDo’s generalization algorithm (from 85) retains only 
COMMITS and UPDATES as necessary actions, false neg- 
ative misclassifications will cause KarDo to skip one of 
these necessary UPDATES or COMMITs during playback. 
False positives on the other hand will cause unneces- 
sary actions to be retained, requiring KarDo to attempt 
to playback irrelevant actions which may be unavailable 
ona test VM. We calculate the false positive rate for each 
of the two classifiers as the percentage of actions in the 
COMMIT/UPDATE class that should not be in it, and the 
false negative rate as the percentage of actions not in the 
COMMIT/UPDATE class but should be in it. 


The resulting performance of the KarDo classifiers is 
shown in Table 2. As we can see, the ML classifiers per- 
form quite well even though classification mistakes ac- 
count for almost half of the playback failures. Specifi- 
cally, the COMMIT classifier has a false positive rate of 
only 2% and a false negative rate of only 3%. The COM- 
MIT classifier performs so well because COMMITS follow 
very predictable patterns, i.e., they almost always occur 
when a button is pressed, and very frequently cause the 
associated window to close. The UPDATE classifier per- 


|__|] False Positive Rate | False Negative Rate 


Commins | __ 2% 3% 
_Urpares | __ 6% 5% 


Table 2: Performance of the COMMIT and UPDATE Classifiers. 


forms slightly worse with a 6% false positive rate and a 
5% false negative rate. The higher false positive rate for 
UPDATES is caused by actions using widgets like combo 
boxes and edit boxes which are typically used for UP- 
DATES, but are sometimes used just for navigational pur- 
poses. Occasionally when an action uses one of these 
widgets only for navigation (i.e., it’s not an UPDATE), 
KarDo will misclassify the action as an UPDATE action. 
The higher false negative rate stems from actions which 
are both UPDATES and COMMITS. These actions tend to 
look much more like COMMITS than UPDATES and as a 
result the COMMIT classifier typically correctly classifies 
them, but the UPDATE classifier occasionally misclassi- 
fies them, not realizing they are also UPDATES. One 
such example is clicking the button to defragment your 
hard drive, which looks very much like a COMMIT ac- 
tion as it is a button click, and closes the associated win- 
dow, but does not look very much like a typical UPDATE 
action since button clicks usually do not update any sys- 
tem state. In fact, if we test the UPDATE classifier after 
removing actions that are both COMMITS and UPDATES 
from the training and test sets the false negative rate drops 
to 2% without increasing the false positive rate at all. 

Note that a misclassification does not necessarily cause 
an error in the resulting canonical trace. In particular, 
only misclassifications that result in the eventual discard 
of a necessary action produce erroneous task solutions. 
For example, one may misclassify an action that is both 
COMMIT and UPDATE as only COMMIT. Still, as long as 
the mistake removal algorithm keeps this action as neces- 
sary, the resulting solution will still perform the UPDATE. 

To evaluate the effect of classification mistakes on the 
final playback performance, we ran an “Oracle Classi- 
fier’ version of KarDo where instead of using the output 
from the ML classifier to determine whether an action 
is an UPDATE or a COMMIT, we directly use the hand 
generated labels so that all classifications are correct. As 
shown in Fig. 7 this increases the playback success rate 
by an additional 7%. More training data would help elim- 
inate these mistakes. 


(b) Incorrect Navigation Steps: The next cause of play- 
back problems comes from limitations in the way we 
currently generate the navigation graph. As discussed 
in $4.3, KarDo assumes that navigation depends only on 
the final action that made a widget visible. In a few cases, 
however, navigation depends on other earlier actions in 
the trace. A simple example of this is the “Run” dialog 
box which allows a user to type in the name of a program 
and then click “OK” to run it. In this case, the naviga- 
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Figure 8: Real Time Window System Context Recording: The 
figure shows that KarDo’s optimized recording, which limits recording 
full context information to only the main window, has a response time 
less than 100ms regardless of the number of windows. This is signifi- 
cantly below the 200ms threshold at which users perceive the UI to be 
sluggish. In contrast, recording the full context of all windows has a 
response time that scales with the number of windows, eventually be- 
coming very slow. 


tion depends not only on clicking “OK”, but also on the 
program name filled into the edit box. 

To test the effect of incorrect navigation steps on the fi- 
nal playback success, we hand labeled all such dependent 
navigation actions. We then ran a “Oracle Navigation” 
version of KarDo where each navigation step had the full 
set of required actions associated with it. As shown in 
Fig. 7 this increases the playback success by an additional 
4%. These mistakes can be eliminated by the additional 
classifier discussed in $12. 


(c) Missing Navigation Steps: The final cause of play- 
back problems stems from KarDo’s fairly limited view 
of the GUI navigation landscape, due to the relatively 
small number of input traces in our experiments. Specif- 
ically, since many of the traces KarDo uses to generate 
its solutions are performed by users that already know 
how to perform a task, these traces rarely include navi- 
gation information related to incorrect navigations. This 
can cause playback to fail in the small fraction of cases 
where KarDo navigates in a way that is not appropriate 
for a given configuration and thus results in an error di- 
alog box or some other GUI widget/window which was 
not seen in any trace. In this case, to ensure that it does 
not cause any problems, KarDo will immediately abort 
playback and roll back the user’s machine to its original 
state. These type of errors account for most of the re- 
maining 5% of playback errors shown in Fig. 7, and can 
be solved by more traces. 


10.4 Feasibility Micro-Benchmarks 


We want to ensure that KarDo’s design performs well 
enough to be feasible in practice. To test this, we ran 
three performance tests on a standard 2.4 GHz Intel 
Core2 Duo desktop machine. 

First, as discussed 89.1, context recording has to be 
fast so that it does not cause the user to perceive the UI as 
sluggish. Fig. 8 shows that even with many windows on 
the screen, KarDo can grab the relevant windowing sys- 
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Figure 9: Performance of Solution Merging: The figure graphs the 
time that KarDo takes to merge a given number of traces, showing that 
eae can scale to quickly merge a large number of traces for a given 
tem context in well less than 100ms, and the overhead is 
relatively constant regardless of the number of windows. 
Since users only start to notice delay when it is greater 
than 200ms [17], this additional delay should be accept- 
able to users. In contrast a scheme which records the 
context of all windows reaches an unacceptable delay of 
more than | second with even just 15 windows open. 

Next, we check the performance of solution merging. 
Fig. 9 shows that merging up to 50 traces takes only 15 
seconds, and it takes less than a second to merge 5 traces. 
This result shows that KarDo can easily scale to merging 
a large number of traces for each task. 

Finally, KarDo’s playback is relatively fast. For the 57 
tasks in Table 3, playing a KarDo solution takes on av- 
erage 52 seconds with a standard deviation of 9 seconds. 
The maximum replay time was 125 seconds, which was 
mostly spent waiting for the virus scanner to finish. 


10.5 Working with Users 


We evaluate KarDo’s ability to improve on the status quo 
of using text instructions to perform computer tasks. We 
asked 12 CS students to perform 5 computer tasks within 
1 hour, based on instructions from our lab website. We 
also used KarDo to automate each task by merging the 
students’ traces into a single canonical solution. 

We find three important results. First, as shown in 
Fig. 10(a), even with detailed instructions, the students 
fail to correctly complete the tasks in 20% of the cases. 
In contrast, KarDo always succeeded in generating a so- 
lution that automated the task on all 12 user machines. 

Second, as shown in Fig. 10(b), even when the stu- 
dents did complete the tasks they performed on average 
84% more GUI actions than necessary, and sometimes 
more than three times the necessary number of actions. 
KarDo’s automation removes most of these irrelevant ac- 
tions, performing only 11% more actions than necessary. 

Third, as shown in Fig. 10(c), KarDo reduced the per- 
task required number of times the user had to interact 
with the machine from 25 to 2 times, on average. This 
reduction is because KarDo requires manual entry only 
for user-specific inputs, and automates everything else. 


USENIX Association 


USENIX Association 












































IMAP Double-Sided Active Virus Scan Fix FireFox 

Configuration Printing Directory Certificate 
KarDo Yes Yes Yes Yes Yes 
User 1 Yes Yes Yes Yes Yes 
User 2 Yes No Yes Yes Yes 
User 3 Yes Yes Yes Yes Yes 
User 4 Yes No No Yes Yes 
User 5 No No Yes Yes Yes 
User 6 No Yes No Yes Yes 
User 7 Yes Yes Yes Yes Yes 
User 8 Yes No Yes Yes Yes 
User 9 Yes No Yes Yes Yes 
User 10 No No Yes Yes Yes 
User 11 Yes Yes Yes Yes Yes 
User 12 Yes Yes Yes Yes Yes 


























(a) Task successes and failures. 


200 m User Average m KarDo 


%age of irrelevant actions 
b 
So 
o 





oO 


IMAP DS Print AD Vscan Cert 


(b) Percentage irrelevant actions performed by users and KarDo 





40 m Manual =m KarDo 
Fe 30 
<x 
g 20 
2 
10 
» | a 
IMAP DS Print AD Vscan Cert 


(c) User manual inputs with and without KarDo. 


Figure 10: Working with Users: The figures shows that (a) KarDo 
performs the task correctly, even when many users fail, (b) KarDo fil- 
ters most irrelevant actions, and (c) with KarDo users need to manually 
perform very few steps, typically only those which require user-specific 
information. 

These results show that KarDo can help users reduce the 


time and effort spent on IT tasks. 


11 Related Work 


While there are many tools to help automate computer 
tasks, most either do not support recording and must 
be scripted by programmers (e.g., AutoIt [2] and Auto- 
HotKey [1]), or allow recording only by relying on appli- 
cation specific APIs and thus cannot be used to automate 
generic computer tasks (e.g., macros, Doc Wizards [14]). 
Apple’s Automator [3], Sikuli [13] and AutoBash [18] 
are the only exceptions as far as we know. However, nei- 
ther Automator nor Sikuli can automatically produce a 
canonical GUI solution that works on different machine 
configurations. AutoBash covers only tasks which are 
entirely contained on the local machine, which is increas- 


ingly infrequent with today’s networked computer sys- 
tems. Additionally, it requires modifying the kernel to 
track dependencies across applications and then taking 
diffs of the affected files. Such kernel modifications are a 
deployment barrier, and file diffs are ineffective on binary 
file formats. 

Some tools support recording and check pointing, such 
as DejaView [16], but they do not actually playback a 
task, instead only returning to a checkpointed state. 

Lastly, there are tools that leverage shared information 
across a large user population [21, 20, 15, 19, 12, 11]. 
Strider [21] and PeerPressure [20] diagnose configura- 
tion problems by comparing entries in Windows registry 
on the affected machine against their values on a healthy 
machine or their default values in the population. FTN 
addresses the privacy problem in sharing configuration 
state by resorting to social networks [15]. [19] and [12] 
track kernel calls similar to AutoBash to determine prob- 
lem signatures and their solutions. NetPrints [11] collects 
examples of good and bad network configurations, builds 
a decision tree, and determines the set of configuration 
changes needed to change a configuration from bad to 
good. All of these tools compare potentially problematic 
state information against a healthy state to address com- 
puter problems and failures. KarDo focuses on a com- 
plementary issue where the existing machine state maybe 
perfectly functional but the user wants to perform a new 
task. KarDo addresses such how-to tasks by working at 
the GUI level, which allows it to handle any general task 
the user can perform. 


12 Addressing KarDo’s Limitations 


While our system represents a first step towards provid- 
ing a system for automating a task by doing it, our cur- 
rent implementation has multiple limitations we expect 
to explore in future work. First, our model of labeling 
all actions as COMMITS, UPDATES and NAVIGATE ac- 
tions is not exhaustive. Specifically, it does not cover 
tasks which simply show something on the screen. For 
example, a task like “Find my IP Address” will look to 
KarDo like it does nothing, and so all actions will be re- 
moved. This can be addressed by extending the model. 
Second, as discussed in 810.1, it does not handle tasks 
containing complex navigation actions. For example if 
navigation requires typing the name of a program in an 
edit box and then clicking “Run” then KarDo will only 
click the “Run” button. This can be solved using an addi- 
tional classifier to detect these dependent navigation ac- 
tions. Finally, KarDo requires unnecessary manual steps 
when entering the same user specific information across 
many tasks. For example, a user will have to manually 
enter his Google username every time he wants to run any 
task that accesses Google services.To handle this, we’d 
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like to build a profile for each user which will remember 
previous inputs by a user and reuse them across tasks. 


13. Concluding Remarks 


This paper presents a system for enabling automation of 
computer tasks, by recording traces of low-level user ac- 
tions, and then generalizing these traces for playback on 
other machine configurations through the use of machine 
learning and static analysis. We show that automated 
tasks produced by our system work on 84% of config- 
urations, while baseline automation techniques work on 
only 18% of configurations. 

This paper has focused on use of our system for build- 
ing an on-line repository of automated IT tasks which 
would include both local configuration and setup as well 
as remote tasks such as configuring a wireless router. We 
note, however, that our system is useful for many other 
applications as well, including replacing IT knowledge- 
bases, automated software testing, and even use by expert 
users as an easy way to automate repetitive tasks. 
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Abstract 


Software misconfigurations are time-consuming and 
enormously frustrating to troubleshoot. In this paper, we 
show that dynamic information flow analysis helps solve 
these problems by pinpointing the root cause of config- 
uration errors. We have built a tool called ConfAid that 
instruments application binaries to monitor the causal 
dependencies introduced through control and data flow 
as the program executes — ConfAid uses these depen- 
dencies to link the erroneous behavior to specific to- 
kens in configuration files. Our results using ConfAid to 
solve misconfigurations in OpenSSH, Apache, and Post- 
fix show that ConfAid identifies the source of the miscon- 
figuration as the first or second most likely root cause for 
18 out of 18 real-world configuration errors and for 55 
out of 60 randomly generated errors. ConfAid runs in 
only a few minutes, making it an attractive alternative to 
manual debugging. 


1 Introduction 


Complex software systems are difficult to configure and 
manage. When problems inevitably arise, operators 
spend considerable time troubleshooting those problems 
by identifying root causes and correcting them. The cost 
of troubleshooting is substantial. Technical support con- 
tributes 17% of the total cost of ownership of today’s 
desktop computers [24], and troubleshooting misconfig- 
urations is a large part of technical support. For informa- 
tion systems, administrative expenses, made up almost 
entirely of people costs, represent 60-80% of the total 
cost of ownership [16]. Even for casual computer users, 
troubleshooting is often enormously frustrating. 

In this paper, we show that system support for dynamic 
information flow analysis can substantially simplify and 
reduce the human effort needed to troubleshoot software 
systems. We focus specifically on configuration errors, 
in which the application code is correct, but the software 


has been installed, configured, or updated incorrectly so 
that it does not behave as desired. For instance, a mistake 
in a configuration file may lead software to crash, assert, 
or simply produce erroneous output. 


Why address misconfigurations specifically? Empiri- 
cal evidence exists that misconfigurations are often the 
dominant cause of problems in deployed systems. For 
example, Gray [20] attributed 42% of system outages to 
administration, while software, hardware, and environ- 
ment failures account for 25%, 18%, and 14% of failures, 
respectively. Murphy and Gent [31] note that the per- 
centage of failures attributable to system management is 
increasing over time, and that management failures have 
come to dominate the combination of software and hard- 
ware failures. Other studies have shown that configura- 
tion errors are the largest category of operator mistakes. 
Oppenheimer et al. [35] studied three commercial Inter- 
net services and found that more than 50% of the opera- 
tor mistakes that led to service unavailability were mis- 
configurations. Nagaraja et al. [33] found that software 
misconfiguration was the most common type of operator 
mistake, accounting for more than half of all mistakes. 
Other studies have shown similar results [7, 8, 23]. Fur- 
ther, while fault tolerance techniques such as modular 
redundancy [30] or Byzantine fault tolerance [10] can 
mask software and hardware faults, they do not prevent 
human error such as an operator who misconfigures all 
replicas [20, 23]. 

Consider how users and administrators typically de- 
bug configuration problems. Misconfigurations are often 
exhibited by an application unexpectedly terminating or 
producing erroneous output. While an ideal application 
would always output a helpful error message when such 
events occur, it is unfortunately the case that such mes- 
sages are often cryptic, misleading, or even non-existent. 
Thus, the person using the application must ask col- 
leagues and search manuals, FAQs, and online forums to 
find potential solutions to the problem. Troubleshooting 
is a tedious, time-consuming process that can substan- 
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tially increase the time to recover (TTR) from a failure. 

To remedy this problem, we have developed a tool, 
called ConfAid, that uses dynamic information flow 
analysis to identify the likely root cause of a configu- 
ration problem. When a user or administrator wishes to 
troubleshoot a problem such as a crash or incorrect out- 
put, she reproduces the problem while ConfAid modi- 
fies the executed application binaries to track the causal 
dependencies between configuration inputs and program 
behavior. ConfAid produces an ordered list of the con- 
figuration tokens most likely to have caused the exhibited 
problem. While dynamic analysis takes a few minutes 
for a complex application such as Apache, automated 
troubleshooting is still considerably faster and less labor- 
intensive than manual debugging or searching through 
FAQs and online forums. 

ConfAid dynamically tracks causality (i.e., informa- 
tion flow) at a fine granularity, namely at the level of 
instructions and bytes. While there is a large body of 
work in the distributed systems community that tracks 
causality to understand and troubleshoot program behav- 
ior [2, 5, 6, 11, 12, 13], these prior systems essentially 
treat application binaries as black boxes, understanding 
causal relationships between processes by tracking net- 
work messages and IPCs. Some gain more information 
by inserting probes into applications to glean hints about 
their activity. ConfAid, however, “opens up the black- 
box” by examining the flow of causality within processes 
as they execute. Further, since ConfAid tracks causality 
using binary instrumentation [29], it does not require ap- 
plication source code to find misconfigurations. 

ConfAid restricts the scope of information flow anal- 
ysis to only track values that depend on data read from 
configuration files. ConfAid tracks dependencies intro- 
duced by both data and control flow. If it determines that 
altering a configuration parameter may change the ap- 
plication’s control flow such that it avoids the problem 
(and does not exhibit a different problem), it reports that 
parameter as a possible root cause. It propagates depen- 
dencies among multiple processes in a distributed system 
by annotating IPCs and network communication. 

Our results show that ConfAid identifies the correct 
root causes of most configuration errors. We injected 
18 real-world misconfigurations into OpenSSH, Apache, 
and the Postfix email server. ConfAid identifies the cor- 
rect root cause as the most likely source of the miscon- 
figuration in 13 cases; for the remaining 5 bugs, it lists 
the correct root cause as the second most likely option. 
ConfAid analysis takes less than 3 minutes, making the 
tool an attractive alternative to manual troubleshooting. 


2 Design principles 


We next briefly describe ConfAid’s design principles. 
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2.1 Use white-box analysis 


The genesis of ConfAid arose from AutoBash [37], our 
prior work in configuration troubleshooting. AutoBash 
tracks causality at process and file granularity in order 
to diagnose configuration errors. It treats each process 
as a black box, such that all outputs of the process are 
considered to be dependent on all prior inputs. We found 
AutoBash to be very successful in identifying the root 
cause of problems, but the success was limited in that 
AutoBash would often identify a complex configuration 
file, such as Apache’s httpd.conf, as the source of an 
error. When such files contain hundreds of options, the 
root cause identification of the entire file is often too neb- 
ulous to be of great use. 

Our take-away lessons from AutoBash were: (1) 
causality tracking is an effective tool for identifying root 
causes, and (2) causality should be tracked at a finer 
granularity than an entire process to troubleshoot appli- 
cations with complex configuration files. These observa- 
tions led us to use a white box approach in ConfAid that 
tracks causality within each process at byte granularity. 

The granularity of the root causes reported to the user 
is also much finer. Instead of reporting the entire con- 
figuration file as a root cause, ConfAid points its users 
to specific tokens in the configuration file that it believes 
to be in error. This approach narrows down root causes 
considerably for programs like Apache. 


2.2 Operate on application binaries 


We next considered whether ConfAid should require ap- 
plication source code for operation. While using source 
code would make analysis easier, source code is unavail- 
able for many important applications, which would limit 
the applicability of our tool. Also, we felt it likely that 
we would have to choose a subset of programming lan- 
guages to support, which would also limit the number of 
applications we could analyze. 

For these reasons, we decided to design ConfAid to 
not require source code; ConfAid instead operates on 
program binaries. ConfAid uses Pin [29] to dynamically 
insert instrumentation into binaries as applications run. 
It also uses IDA Pro [22] to statically generate control 
flow graphs from binaries. 


2.3. Embrace imprecise analysis 


Our final design decision was to embrace an imprecise 
analysis of causality that relies on heuristics rather than 
using a sound or complete analysis of information flow. 
Using an early prototype of ConfAid, we found that for 
any reasonably complex configuration problem, a strict 
definition of causal dependencies led to our tool out- 
putting almost all configuration values as the root cause 
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of the problem. Many registers and bytes in the address 
space would come to depend on almost all configuration 
values. Our prototype would identify the root cause as 
only one of many possible causes. 

Thus, our current version of ConfAid uses several 
heuristics to limit the spread of causal dependencies. For 
instance, ConfAid does not consider all dependencies 
to be equal. It considers data flow dependencies to be 
more likely to lead to the root cause than control flow 
dependencies. It also considers control flow dependen- 
cies introduced closer to the error exhibition to be more 
likely to lead to the root cause than more distant ones. In 
some cases, ConfAid’s heuristics can lead to false nega- 
tives and false positives. However, our results show that 
in most cases, they are quite effective in narrowing the 
search for the root cause and reducing execution time. 


3 Design and implementation 


3.1 Overview: How ConfAid runs 


ConfAid is designed to be used by system administra- 
tors and end users when they encounter a suspected mis- 
configuration that they do not know how to fix. Conf- 
Aid is run offline, once erroneous behavior has been ob- 
served. A ConfAid user reproduces the problem by exe- 
cuting the application while ConfAid attaches to the ex- 
ecuting application processes and monitors information 
flow within them. For non-deterministic bugs, ConfAid 
could potentially leverage one of several deterministic re- 
play systems that can capture a buggy non-deterministic 
execution and faithfully reproduce it for later analy- 
sis [3, 18, 27, 36]. 

To use ConfAid, a user specifies: (1) which binaries 
ConfAid should monitor, (2) the sources of configura- 
tion data, and, as needed, (3) the erroneous external out- 
put of the application. For simple applications, Conf- 
Aid may monitor only a single process. For more com- 
plicated applications, ConfAid dynamically attaches to 
multiple specified processes and monitors inter-process 
dependencies as described in Section 3.5. While Conf- 
Aid could potentially monitor any process that receives 
input via IPC or a network message from a process al- 
ready monitored by ConfAid, we decided to only mon- 
itor executables specified by the user in order to limit 
the scope of analysis. Our prior experience with Auto- 
Bash showed that many extraneous processes communi- 
cate with processes being debugged via channels such 
as files, pipes, and signals, yet these processes are not 
needed to determine the root cause. 

Similarly, we could potentially treat any source of in- 
put to a program as a source of configuration data. How- 
ever, such an approach would dramatically slow the anal- 
ysis since most locations in the process address space 


would come to depend on one or more inputs. In con- 
trast, ConfAid only monitors input from designated con- 
figuration sources. This makes ConfAid analysis more 
tractable than generic taint tracking or program slic- 
ing because the number of locations with dependencies 
is small. Typically, the sources to monitor are self- 
evident; e.g., httpd.conf is the configuration source 
for Apache. Potentially, we could automate this process 
by treating all inputs from specific locations (e.g., the 
etc directory) or files with semantic keywords (such as 
“k conf”) as configuration inputs. 

Finally, a ConfAid user may designate specific error 
conditions. ConfAid automatically treats assertion fail- 
ures and exits with non-zero return codes as an erroneous 
terminations. However, some misconfigurations lead not 
to program termination, but instead to the process pro- 
ducing erroneous output. We therefore allow the user to 
specify a particular string expression as erroneous. Conf- 
Aid monitors the system calls that write to network, ter- 
minal, and other external output channels. When it finds 
a matching output, it considers the output an error. 

ConfAid outputs an ordered list of probable root 
causes. Each entry in the list is a token from a config- 
uration source; our results show that ConfAid typically 
outputs the actual root cause as the first or second entry 
in the list. This allows the ConfAid user to focus on one 
or two specific configuration tokens when deciding how 
to fix the problem. By finding the needle in the haystack, 
ConfAid dramatically improves TTR. 


3.2 Basic information flow analysis 


In this section, we describe the basic information flow 
analysis used by ConfAid. For simplicity of explana- 
tion, we defer discussing optimizations and heuristics 
until Sections 3.3 and 3.4. We also assume that ConfAid 
is tracking only a single process; Section 3.5 describes 
how we extend ConfAid analysis to multiple cooperat- 
ing processes on one or more computers. 

ConfAid dynamically monitors the information flow 
from configuration sources through process memory and 
registers to the point in the program execution when erro- 
neous behavior is observed. It does so by using Pin [29] 
to add custom logic, referred to as instrumentation, to the 
process binary. As described below, ConfAid instrumen- 
tation is executed before or after most x86 instructions 
executed by a monitored application. 

ConfAid uses taint tracking [34] to analyze informa- 
tion flow. It inserts instrumentation into the binary that 
monitors each system call such as read or pread that 
could potentially read data from a configuration source. 
If the source of the data returned by a system call was 
specified as a configuration file, ConfAid annotates the 
registers and memory addresses modified by the system 
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if (c == 0) { /* c set to O in config file */ 


X = a; /* taken path */ 
} else { 

y =b; /* alternate path */ 
} 
z=d; 


if (z) assert(); /* The erroneous behavior */ 


Figure 1: Example to illustrate causality tracking 


call with a marker that indicates a dependency on a spe- 
cific configuration token. Borrowing terminology from 
the taint tracking literature, we refer to this marking as 
the taint of the memory location. If an address or regis- 
ter is tainted by a token, ConfAid believes that the value 
at that location might be different if the value of the token 
in the original configuration source were to change. 


We use the notation, 7, to denote the taint set of vari- 
able x. 7; is a set of configuration tokens; for instance, if 
T, = { FOO, BAR }, ConfAid believes that the value of 
variable x could change if the user were to modify either 
the FOO or BAR tokens in the configuration file. In the ba- 
sic information flow analysis, taints are binary (a location 
is either tainted by a token or it is not); in Section 3.4, we 
attach a weight to each taint. 


Taint is propagated via data flow and control flow de- 
pendencies. When a monitored process executes an in- 
struction that modifies a memory address, register, or 
CPU flag, the taint set of each modified location is set 
to the union of the taint sets of the values read by the 
instruction. For example, given the instruction x = y+ z 
where the taint sets of y and z are T, and T, respectively, 
the taint set of x, T,, becomes T, UT;. Intuitively, the 
value of x might change if a configuration token were to 
cause y or z to change prior to the execution of this in- 
struction. For example, if T, = { FOO, BAR } and 7, = 
{ FOO, BAZ },then7,={ FOO, BAR, BAZ }. 


In traditional taint tracking for security purposes, con- 
trol flow dependencies are often ignored to improve per- 
formance because they are harder for an attacker to ex- 
ploit. With ConfAid, however, we have found that track- 
ing control flow dependencies is essential since they 
propagate the majority of configuration-derived taint. 


A naive approach to tracking control flow is to union 
the taint set of a branch conditional with a running con- 
trol flow dependency for the program. For example, on 
executing the statement if (b), ConfAid could set the 
control flow taint set, T.¢, to T.¢ UT. However, without 
mechanisms to remove taint from T,¢, control flow taint 
grows without limit. This causes too many false pos- 
itives, i.e., ConfAid would identify most configuration 
tokens as possible root causes. 
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A more precise approach takes into account the ba- 
sic block structure of a program. Consider the example 
in Figure 1. Assume a, b, c, and d were read from a 
configuration file and have taint sets T,, T,, T-, and Ty, 
respectively (i.e., 7; is a set containing only configura- 
tion token a). The value of c does not affect whether the 
last two statements are executed, since they execute in all 
possible paths (and therefore for all values of c). Thus, 
T. should be removed from JT. before executing z = d. 
When the program asserts, 7,.¢ should only include 7j in 
the example, to correctly indicate that changing the value 
of d might fix the problem. 

ConfAid also tracks implicit control flow dependen- 
cies. In Figure 1, the values of x and y depend on c when 
the program asserts, since the occurrence of their assign- 
ments to a and b depend on whether or not the branch is 
taken. Note that y is still dependent on c even though the 
else path is not taken by the execution since the value of 
y might change if a configuration token is modified such 
that the condition evaluates differently. 

When the program executes a branch with a tainted 
condition, ConfAid first determines the merge point (the 
point where the branch paths converge) by consulting the 
control flow graph. Prior to dynamic analysis, ConfAid 
obtains the graph by using IDA Pro to statically analyze 
the executable and any libraries it uses (e.g., Libc and 
libssl). 

For each tainted branch, ConfAid next explores each 
alternate path that leads to the merge point. We define an 
alternate path to be any path not taken by the actual pro- 
gram execution that starts at a conditional branch instruc- 
tion for which the branch condition is tainted by one or 
more configuration values. ConfAid uses alternate path 
exploration to learn which variables would have been as- 
signed had the condition evaluated differently due to a 
modified configuration value. The taint set of any vari- 
able assigned on an alternate path is set to the union of 
its previous taint set, the taint set of the conditional, and 
the taint set of the variables read by the assigning instruc- 
tion. In the example, 7, = 7, UT; U{T \T;}. In other 
words, a configuration token affecting the previous value 
of y could change, or c could change, causing the pre- 
vious value of y to be overwritten. Finally, it might be 
necessary for both c and b to change (as denoted by the 
term {T.  T;}) since c allows the alternate assignment, 
and b may need to reflect a correct configuration value. 

To evaluate an alternate path, ConfAid executes the 
program by switching the condition outcome, similar 
to the predicate switching approach used by Zhang et 
al. [48] to explore implicit dependencies. ConfAid uses 
copy-on-write logging to checkpoint and roll back ap- 
plication state. When a memory address is first altered 
along an alternate path, ConfAid saves the previous value 
in an undo log. At the end of the execution, applica- 
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tion state is replaced with the previous values from the 
log. ConfAid uses Pin mechanisms to checkpoint and 
rollback the state of the processor, which includes the 
registers and CPU flags. Since some alternate paths are 
quite long, ConfAid uses a bounded horizon heuristic 
described in Section 3.3.1 to limit the number of in- 
structions it explores along each alternate path. Many 
branches need not be explored since their conditions are 
not tainted by any configuration token. 

After exploring the alternate paths, ConfAid performs 
a similar analysis for the path actually taken by the pro- 
gram. This is the actual execution, so no undo log is 
needed. In the example, analyzing the taken path would 
derive T, = T, UT, U{T. A T;}. 

ConfAid also uses alternate path exploration to learn 
which paths avoid erroneous application behavior. Conf- 
Aid considers an alternate path to avoid the erroneous 
behavior if the path leads to a successful termination of 
the program or if the merge point of the branch occurs 
after the occurrence of the erroneous behavior in the pro- 
gram (as determined by the static control flow graph). 
ConfAid unions the taint sets of all conditions that led to 
such alternate paths to derive its final result. This result is 
the set of all configuration tokens which, if altered, could 
cause the program to avoid the erroneous behavior. 

Figure 2 shows four examples that illustrate how Conf- 
Aid detects alternate paths that avoid the erroneous be- 
havior. In case (a), the error occurs after the merge point 
of the conditional branch. ConfAid determines that the 
branch does not contribute to the error, because both 
paths lead to the same erroneous behavior. In case (b), 
the alternate path avoids the erroneous behavior because 
the merge point occurs after the error, and the alternate 
path itself does not exhibit any other error. In this case, 
ConfAid considers tokens in the taint set of the branch 
condition as possible root causes of the error, since if 
the application had taken the alternate path, it could have 
avoided the error. In case (c), the alternate path leads to 
a different error (an assertion). Therefore, ConfAid does 
not consider the taint of the branch as a possible root 
cause because the alternate path would not lead to a suc- 
cessful termination. In case (d), there are two alternate 
paths, one of which leads to an assertion and one that 
reaches the merge point. In this case, since there exists 
an alternate path that avoids the erroneous behavior, con- 
figuration tokens in the taint set of the branch condition 
are possible root causes. 

One limitation of evaluating an alternate path with 
predicate switching is that switching a predicate out- 
come, but not the underlying data values, may result in an 
“unnatural” execution that leads to erroneous behaviors, 
such as a crash due to a segmentation fault. In such cir- 
cumstances, ConfAid aborts exploration of the alternate 
path but conservatively retains the taint of the conditional 
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Figure 2: Examples illustrating ConfAid path analysis 


branch in the possible root causes. This conservative be- 
havior may lead to false positives if the alternate path 
would in fact lead to a real error later in the execution. 
The early abort of the alternate path may also lead to 
false negatives due to unexplored variable assignments. 


3.2.1 Abstracting library functions and system calls 


There are three cases where ConfAid does not dynami- 
cally analyze information flow. The first case is when the 
application makes a system call. Since ConfAid does not 
track taint inside the operating system, the information 
flow analysis stops at the system call entry. The sec- 
ond case is commonly executed standard library func- 
tions such as malloc in libc and cryptographic func- 
tions in 1ibss1l. ConfAid uses a primitive static analy- 
sis for these functions to improve analysis speed while 
still producing the identical effect on process taint values 
that would have been produced by a fully-instrumented 
execution. Since we abstract only functions in stan- 
dard libraries, such taint abstractions are application- 
independent. The final case is a small number of heavily 
optimized libc functions for which IDA Pro does not 
produce a complete static analysis. 

To handle these cases, ConfAid uses taint abstraction 
of the function (or system call). A taint abstraction spec- 
ifies how taint is propagated from the inputs of the func- 
tions to its outputs (e.g., return values and modified lo- 
cation in the address space). When a process calls one 
of these functions, ConfAid first executes the function 
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without any instrumentation and then uses the taint ab- 
straction to modify the taints of the process memory and 
registers. 


3.3 Heuristics for performance 


ConfAid uses two heuristics to simplify control flow 
analysis. These heuristics eliminate exploration of some 
alternate paths to concentrate on the paths that are most 
likely to be useful in identifying the root cause. The 
heuristics reduce analysis time but also introduce false 
positives and negatives. 


3.3.1 The bounded horizon heuristic 


The first heuristic is the bounded horizon heuristic. 
ConfAid only executes each alternate path for a fixed 
number of instructions. By default, ConfAid uses a limit 
of 80 instructions. All addresses and registers modified 
within the limit are used to calculate information flow de- 
pendencies after the merge point. Locations modified af- 
ter the limit do not affect dependencies introduced at the 
merge point. If an alternate path contains further tainted 
conditional branches, ConfAid executes each path un- 
til the limit is reached. For example, if the limit is 80 
instructions and a tainted conditional branch occurs af- 
ter executing 50 instructions, both paths from the new 
branch are executed for an additional 30 instructions. 


3.3.2 The single mistake heuristic 


The second heuristic simplifies control flow analysis by 
assuming that the configuration file contains only a lim- 
ited number of erroneous tokens. By default, ConfAid 
assumes that the configuration file contains a single error 
— we refer to this as the single mistake heuristic. 

To illustrate how this simplifies path exploration, con- 
sider again the example in Figure 1. Recall that at the 
time the assert statement is executed, 7. = JT, UT. U 
{T. \T,}. The single mistake heuristic eliminates the 
last term since that term requires the values of two to- 
kens to change simultaneously. Similarly, ConfAid de- 
rives T, = T, UT, during alternate path exploration. Note 
that 7 no longer depends upon 7. This seems counter- 
intuitive, but for the assignment y = b to occur in the 
program, a token in 7, must change to cause the alternate 
path to be taken. With the single mistake heuristic, a to- 
ken in 7; but not in 7. cannot be the root cause, since one 
token in 7 already must change. 

More importantly, restricting the number of configu- 
ration values that can change reduces the alternate paths 
that are explored, as shown in Figure 3. The nested con- 
dition, c2, can change only if a single configuration value 
affects both cl and c2. If T; VT-2 = O, then the alternate 
path of c2 need not be explored at all. 
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if (cl == 0) { /* c1 set to O in config file */ 


} else { 


if (c2 == 0) { /* c2 set to 0 also */ 
X= a; 
} else { 
y = b; 
} 
} 


Figure 3: Example to illustrate alternate path pruning 


To implement this heuristic, we introduce a new vari- 
able, 7, that is the set of configuration options that, if 
changed, would cause the execution of the program to 
reach the current instruction. Initially, 7,;, is the set of 
all configuration tokens. At each condition, c, Tz, does 
not change along the taken path, but we set Ta, = Tay VT 
along the alternate path. In Figure 3, Tay = To. VTe2 af- 
ter the second condition. When 7,;; is 0, the alternate 
path is explored no further. When a variable is assigned 
along an alternate path, its taint value is set to the union 
of its previous taint set and 7,;,. Thus, 7 = 7 U 7.1 and 
Ty = Ty U (Ter NT2). 

The single mistake heuristic may lead to false nega- 
tives. In Figure 3, if cl and c2 are tainted by a disjoint 
set of tokens, ConfAid will not explore the path on which 
y is assigned to b, so it may miss the root cause if the 
program later asserts based on the value of y. Potentially, 
if ConfAid cannot find a root cause, we can relax the 
single-mistake assumption by allowing ConfAid to as- 
sume that two or more tokens are erroneous. In our ex- 
periments to date, this heuristic has yet to trigger a false 
negative. 


3.4 Heuristics for reducing false positives 


We originally designed ConfAid to use only the basic 
taint tracking algorithm described in Section 3.2 with the 
bounded horizon and single mistake heuristics. However, 
our initial experiments with this design met with only 
limited success. Typically, ConfAid would include the 
root cause of a misconfiguration in its output set, yet the 
cardinality of the output set would be very large. For 
many bugs, ConfAid would return a significant fraction 
of the tokens in the configuration file. 

In analyzing our initial results, we realized that it was 
insufficient to track information flow dependencies as bi- 
nary values. In our design as described so far, two con- 
figuration tokens are considered equal taint sources even 
if one has a direct causal relationship to a location (e.g., 
the value in memory was read directly from the configu- 
ration file) and another has a nebulous relationship (e.g., 
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the taint was propagated along a long chain of condi- 
tional assignments deep along alternate paths). 

Another problem we noticed was that loops could 
cause a location to become a global source and sink for 
taint. For instance, Apache reads its configuration values 
into a linked list structure, and then traverses the list in a 
loop to find the value of a particular configuration token. 
During the traversal, the program control flow picks up 
taint from many configuration options, and these taints 
are sometimes transferred to the configuration variable 
that is the target of the search. 

We realized that both of these problems were caused 
by the implicit assumption in our design that all infor- 
mation flow relationships should be treated equally. Es- 
sentially, our design had no shades of gray: it either con- 
sidered a location to be tainted by a token or it did not. 
Based on this observation, we decided to modify our de- 
sign to instead track taint as a floating-point weight rang- 
ing in value between zero and one. For example, the taint 
of x might be represented as { FOO:wfoo, BAR:Wpar }- 
As before, this set indicates that modifying either token 
FOO or BAR might change the value of x. However, if 
Wfoo > Whar, FOO has a more direct relationship to x, 
and hence is believed to be a better candidate for the root 
cause of an error that depends on x. 

We revised ConfAid to use heuristics to weight the 
dependencies introduced by information flow differently, 
with those relationships that are more likely to lead to the 
root cause given a higher weight than those less likely to 
lead to the root cause. We also modified ConfAid to or- 
der the set of tokens on which an erroneous behavior de- 
pends by their respective weights before outputting them. 

Our weights are based on two heuristics. First, data 
flow dependencies are assumed to be more likely to lead 
to the root cause than control flow dependencies. Sec- 
ond, control flow dependencies are assumed to be more 
likely to lead to the root cause if they occur later in the 
execution (i.e., closer to the erroneous behavior). 

Specifically, we assign taints introduced by control 
flow dependencies only half the weight of taints intro- 
duced by data flow dependencies. Further, each nested 
conditional branch reduces the weight of dependencies 
introduced by prior branches in the nest by one half. We 
chose a weight of 0.5 for speed: it can be implemented 
efficiently with a vector bit shift. 

For example, in Figure 4, the assignment x = a is a 
data flow dependency, so T, = T, (any dependencies from 
a are inherited at full weight). However, y inherits taint 
from cl through a control flow dependency. Thus, 7, = 
max(Ta, Ta), That is, we weight any taint from cl by 
half, while taint inherited from a is given full weight. 
We use a special max operator here rather than a simple 
union operator, since the values are now floating point 
rather than binary. Specifically, max(T;,T,) produces a 


x =a; 
if (cl == 0) { /* c1 set to O in config file */ 


y =a; 
3 else { 
Z=Dd; 
} 
if (c2 == 0) { /* c2 set to O in config file */ 
if (c3 == 0) { /* c3 also set to 0 */ 
w=a; 
} 
} 


Figure 4: Example to illustrate the weighting heuristic 


set that contains all tokens that occur in either 7, and T,. 
If a token appears in only one of 7, or T,, its weight is 
set to its weight in that set. If a token appears in both 7, 
and T,, its weight is set to the maximum of its weight in 
either set. 

Similarly, T, = max(T,, iL) (recall that with binary 
values, T, = T, UT; due to the single mistake heuris- 
tic). When ConfAid explores an alternate path, it re- 
places the intersection operator with a corresponding min 
operator. Thus, in the prior example from Figure 3, 
T= max(T,,min( 4, 42)), 

Figure 4 also shows two nested conditions. In calculat- 
ing the taint of w, condition c3 is considered more influ- 
ential than condition c2 because it occurs later in the pro- 
gram execution. Therefore T,, = max(Ty, 73 72), The 
same weighting applies to alternate path execution; as- 
signments on an alternate path starting at the c3 branch 
are given twice the weight as those on an alternate path 
starting at the c2 branch. 

ConfAid also weights alternate paths that avoid the 
erroneous behavior by their proximity to the point in 
application execution where the behavior is exhibited. 
Paths starting from the closest tainted conditional branch 
that avoids the erroneous behavior are given full weight, 
those from the next closest branch are given half weight, 
and so on. Note that if a configuration token has a much 
stronger weight on the condition of a distant branch than 
any tokens for closer branches, ConfAid may still rank it 
as the most likely root cause. 

Of course, when programs do not behave as expected, 
ConfAid’s heuristics may lead to incorrect results. For 
example, an application could potentially execute a sub- 
stantial amount of code between the point where the erro- 
neous behavior occurs and the point where the program 
outputs some value that exhibits the error (e.g., an error 
message). If that code contains a condition tainted by 
a configuration token other than the one that caused the 
error and that condition changes the specific error mes- 
sage that is generated, ConfAid might identify the wrong 
token as the most likely root cause. While such a sce- 
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nario is uncommon, we did observe a single Apache bug 
(described further in the evaluation) in which ConfAid’s 
heuristic failed in this manner. 


3.5 Multi-process causality tracking 


The most difficult configuration errors to troubleshoot 
involve multiple interacting processes. Such processes 
may be on a single computer, or they may reside on mul- 
tiple computers connected by a network. To troubleshoot 
such cases, ConfAid instruments multiple processes at 
the same time and propagates taint information along 
with the data sent when the processes communicate. 

ConfAid supports processes that communicate using 
sockets and files. The socket support includes Unix sock- 
ets and pipes, as well as UDP and TCP sockets. Conf- 
Aid instruments the system calls that create sockets and 
pipes. It marks these objects as taint propagating chan- 
nels if the destination is another instrumented process. 
Then, ConfAid intercepts all sends and receives using 
those channels. When data is sent, ConfAid appends a 
header that indicates whether or not the data is tainted 
and, when applicable, the exact taint of the data. Taint 
information is propagated at per-byte granularity if the 
taints of different bytes of the buffer are different. On 
the receiving side, ConfAid extracts the header from the 
received data and assigns the indicated taints to the re- 
ceived data. 

For files, ConfAid creates an auxiliary file with a spe- 
cial “.confaid” extension when an instrumented process 
writes tainted data to a file. The auxiliary file records 
which bytes in the corresponding file are tainted and 
the specific values of those taints. Like sockets, file 
taint is recorded at granularities as small as one byte. 
For instance, the file “foo.confaid” records the tainted 
bytes in file “foo”. When an instrumented process reads 
data from a file and a corresponding auxiliary file exists, 
ConfAid sets the taints of bytes read from the file to the 
values specified in the auxiliary file. 

Since these operations are performed by PIN instru- 
mentation immediately before and after system call exe- 
cution, the taint propagation is hidden from the applica- 
tion. No operating system modifications are needed. 


3.6 Limitations and future work 


Since configuration troubleshooting is complex, Conf- 
Aid makes a number of assumptions to simplify its anal- 
ysis. First, ConfAid only troubleshoots configuration 
problems that lead to crashes, assertion failures, and in- 
correct output; it does not yet help diagnose misconfig- 
urations that cause poor performance. One approach to 
tackling performance problems that we are investigating 
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is to first use statistical sampling to associate use of a bot- 
tleneck resource such as disk or CPU with specific points 
in the program execution. Then, ConfAid-style analysis 
can determine which configuration tokens most directly 
affect the frequency of execution of those points. 

Second, like previous configuration troubleshooting 
systems [38, 39], ConfAid currently assumes that the 
configuration file contains only one erroneous token. If 
fixing a particular error requires changing two tokens, 
then ConfAid’s alternate path analysis may not identify 
both tokens, as described in Section 3.3.2. However, if 
a file contains two incorrect tokens that represent inde- 
pendent mistakes, ConfAid can tackle the two errors se- 
quentially by first identifying the token that leads to the 
most immediate failure, and then identifying the other 
token once the first error is corrected. The single mis- 
take heuristic improves ConfAid’s performance by re- 
ducing the set of possible taints tracked during dynamic 
analysis. In the future, we plan to allow ConfAid to 
track sets of two or more misconfigured tokens and mea- 
sure the resulting performance overhead. Potentially, we 
may use an expanding search technique in which Conf- 
Aid initially performs an analysis assuming only a single 
mistake, and then performs a lengthier analysis allowing 
multiple mistakes if the first analysis does not yield sat- 
isfactory results. 


4 Evaluation 
Our evaluation answers the following questions: 


e How effective is ConfAid in identifying the root 
cause of configuration problems? 


e How long does ConfAid take to find the root cause? 


4.1 Experimental setup 


We evaluated ConfAid on three applications: the 
OpenSSH server version 5.1, the Apache HTTP server 
version 2.2.14, and the Postfix mail transfer agent version 
2.7. All of our experiments were run on a Dell OptiPlex 
980 desktop computer with an Intel Core i5 Dual Core 
processor and 4GB of memory. The machine runs Linux 
kernel version 2.6.21. For Apache, ConfAid instruments 
a single process; for OpenSSH and Postfix, multiple pro- 
cesses are instrumented. 

To evaluate ConfAid, we manually injected errors into 
correct configuration files. Then, we ran a test case that 
caused the error we injected to be exhibited. We used 
ConfAid to instrument the process (or processes) for that 
application, and obtained the ordered list of root causes 
found by ConfAid. We use two metrics to evaluate Conf- 
Aid’s effectiveness: the ranking of the actual root cause, 
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i.e., the injected mistake, in the list returned by ConfAid 
and the time to execute the instrumented application. 

We used two different methods to generate configu- 
ration errors. First, we injected 18 real-world configu- 
ration errors that were reported in online forums, FAQ 
pages, and application documentation. Second, we used 
the ConfErr tool [25] to inject random errors into the con- 
figuration files of the three applications. 


4.2 Real-world misconfigurations 


We searched forums, FAQ pages and configuration doc- 
uments to find actual configuration problems that users 
have experienced with our target applications. In total, 
we chose 18 misconfigurations (5-7 for each application) 
that were caused by errors in the configuration files. The 
18 misconfigured values cover a range of data types, such 
as binary options, enumerated types, numerical ranges, 
and text entries such as server names. Table | lists the 
configuration errors for each application, as well as the 
ConfAid results. 

In these experiments, ConfAid tracks dependencies 
among multiple processes for all OpenSSH and Postfix 
bugs. For OpenSSH, it instruments two processes that 
communicate via Unix sockets. For Postfix, it instru- 
ments between four and six processes that communicate 
via Unix sockets and files; the number of instrumented 
processes varies depending on how many processes are 
started before a particular bug manifests. Multi-process 
causality tracking is necessary to diagnose 4 out of 5 
Postfix and 3 out of 7 OpenSSH bugs. For Apache, Conf- 
Aid does not track dependencies across processes since 
that application starts only a single process. 

As shown in Table 1, ConfAid is highly effective in 
pinpointing the root cause of misconfigurations. Conf- 
Aid ranks the actual root cause first in 13 cases, and 
second in the other 5. Sometimes, when the actual root 
cause is ranked second, the token ranked first provides a 
valuable clue to help debug the problem. For instance, 
in Apache the actual error usually occurs nested inside a 
section or directive command in the config file. For the 
two Apache errors where the root cause is ranked second, 
the top-ranked option is the section or directive contain- 
ing the error. 

The performance of ConfAid is reasonable. The time 
to manifest the buggy behavior varies among applica- 
tions. Postfix and OpenSSH take less than 2 minutes, 
while Apache takes 2—3 minutes to complete. The av- 
erage execution time of 1:32 minutes is much faster and 
less frustrating than trying to fix such configuration er- 
rors by looking at the logs, searching the Internet, and 
asking colleagues for potential clues. For instance, the 
6th Apache misconfiguration in Table | is taken from a 
thread in linuxforums.org [28]. After trying to fix the 


misconfiguration for quite a while, the user went to the 
trouble of posting the question in the forum and waited 
two days for an answer. In contrast, ConfAid identified 
the root cause in less than 3 minutes. 


4.3 Effect of the weighting heuristic 


We next examine the effect of the weighting heuristic in- 
troduced in Section 3.4. For each of the 18 real-world 
misconfigurations, we disabled the heuristic and re-ran 
ConfAid. With the heuristic disabled, ConfAid treats all 
sources of information flow equally. Therefore, instead 
of producing a ranked list of possible root causes, Conf- 
Aid returns a single set of tokens, each of which is con- 
sidered equally likely to be the root cause. 

The last column of Table | shows the number of false 
positives when the heuristic is disabled. In every case, 
ConfAid identifies the correct root cause as one of the 
returned tokens. However, the number of other tokens 
returned varies substantially. Without the heuristic, there 
were only two misconfigurations (the 6th OpenSSH bug 
and the 5th Postfix bug) for which ConfAid produces no 
false positives. For six other bugs, the number of false 
positives is relatively low (less than 6). For the remain- 
ing 10 bugs, ConfAid returns almost all options as pos- 
sible root causes. Thus, without the weighting heuristic, 
ConfAid is ineffective for 55% of the misconfigurations. 


4.4 Effects of bounded horizon heuristic 


We next investigated the effect of varying ConfAid’s 
limit on the number of instructions executed along each 
alternate path (discussed in Section 3.3.1) from the de- 
fault value of 80 instructions. As Figure 5 shows, varying 
the limit has substantially different effects on execution 
time, depending on the application being instrumented. 
For OpenSSH (bug #1), the execution time increases ap- 
proximately linearly from 56 seconds with no alternate 
path exploration to 2:29 minutes with a horizon of 1600 
instructions. On the other hand, Postfix (bug #1), shows 
an apparently exponential growth as the bound increases. 
The execution time starts at 21 seconds with no alternate 
path exploration and increases to 7:10 minutes for a hori- 
zon of 800 instructions. With a horizon of 1600, ConfAid 
analysis did not complete. 

This difference in behavior derives from the nature of 
the applications. We found that even with a limit of 80 
instructions, more than 80% of the tainted conditional 
branches in the OpenSSH bug reach their merge points 
for all alternate paths. Increasing the horizon only affects 
a small fraction of the branches since the rest are short 
enough to finish within the limit. On the other hand, for 
Postfix, less than 50% of the branches reach their merge 
point within the limit of 80 instructions. As we raise the 
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HTTP Server 


Postfix 
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Bug 





Description of misconfiguration 


The PermitRootLogin option is disabled. Therefore, 
the user cannot ssh as root. The server keeps denying 
permission although the password is entered correctly. 
The server only has the PasswordAuthentication op- 
tion enabled, while the user can only authenticate via 
RSA keys. 

The user does not have his public key in the directory 
specified in the SSH server config file. Therefore, he 
cannot authenticate. 

The user is not in the AllowUsers list in the SSH con- 
fig file. Therefore, he cannot connect to the server al- 
though he enters the password correctly. 

The MaxAuthTries option in SSH server config is set 
too low. Therefore, the user is disconnected if she en- 
ters her password incorrectly once. 

The MaxStartups options is set to 1. Therefore, the 
server refuses to start a new session, while another 
unauthenticated session is still in progress. 

The location of the server RSA key is not set correctly 
in the config file. Therefore, the client fails to verify 
the host key. 


The path specified in the DocumentRoot option does 
not have a corresponding <Directory> section. There- 
fore, all accesses to this path are denied according to 
the default policy. 

The cgi-bin directory is ScriptAlised in the config file. 
This prevents the DirectoryIndex from working as ex- 
pected. Therefore, the user cannot access the index file 
in the directory. 

The cgi-bin directory is aliased in the config file. How- 
ever, the corresponding Directory section does not pro- 
vide sufficient permissions. Therefore, accesses to this 
directory are denied. 

A virtual host with the same interface coverage is set 
for the HTTP server. This host points to a differ- 
ent DocumentRoot which overwrites the default one. 
Therefore, the user gets an index file with incorrect 
content upon accessing the server DocumentRoot. 
The cgi-bin directory is aliased and a CGI Handler is 
activated in the config file. However, the correspond- 
ing <Directory> section does not have the ExecCGI 
option set. The user cannot access the executables in 
this directory. 

A specific directory in DocumentRoot is also aliased to 
another directory outside DocumentRoot. Therefore, 
accesses to files in the first directory are redirected to 
the aliased directory, and the files are not found. 


The mydestination option is not set correctly in the 
Postfix config file. Therefore, Postfix cannot deliver 
mail locally. 

The myorigin option is set incorrectly in the Postfix 
config file. Therefore, the next relay host bounces the 
mail sent from the user’s machine to the Internet. 

The relayhost option is set incorrectly. Therefore, 
Postfix cannot forward the email sent from the user’s 
machine to the Internet. 

The type of alias_maps option is not supported in the 
user’s machine. Therefore, Postfix fails to send any 
mail locally or to the Internet. 

The email address provided in luser-replay is not 
reachable. Therefore, Postfix cannot redirect other 
mail with wrong recipient to the luser-replay. 
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Table 1: Results for 18 real-world configuration bugs 
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Figure 5: The effect of varying the horizon 


limit, the percentage of the completed branches increases 
slowly to 60%. 

To summarize, we found that there is no single limit 
that works best for all applications. Consequently, we 
envision that we could augment ConfAid to use an iter- 
ative search process in which it would start with a small 
horizon to generate results quickly, then continue to exe- 
cute with larger horizons to refine the results. 


4.5 Random fault injection 


We next used ConfErr [25] to randomly generate config- 
uration errors. ConfErr uses human error models rooted 
in psychology and linguistics to generate realistic config- 
uration mistakes. We used ConfErr to produce 20 errors 
for each application. We then injected the errors one by 
one and measured the effectiveness and performance of 
ConfAid. 

As shown in Table 2, ConfAid performs very well on 
these errors. The average time to execute all three appli- 
cations is lower than the average execution time for the 
real-world errors used in the previous section. The main 
reason for this difference is that the real-world errors are 
often more complex than the randomly-generated ones. 
Therefore, it takes more time for the application to man- 
ifest the buggy behavior for real-world errors. 

For the randomly generated errors, ConfAid instru- 
ments up to two processes for OpenSSH and up to six 
processes for Postfix. However, many faults are exhib- 
ited before these applications start additional processes; 
in such cases, ConfAid only instruments one process. 

For OpenSSH, ConfAid successfully pinpointed the 
root cause (where we define success as listing the actual 
root cause as one of the top two options) for 95% of the 
bugs. For the last bug, ConfAid could not run to com- 
pletion due to unsupported system calls used in the code 
path. We could remedy this by abstracting more calls. 


ConfAid also successfully diagnoses 95% of the 
Apache errors. For the remaining error, ConfAid ranks 
the root cause 9th. The configuration error is that the 
DirectoryIndex file for the main document root is listed 
incorrectly in the Apache configuration file. The Directo- 
ryIndex file is the file that Apache serves if that directory 
is accessed without mentioning a specific file. For in- 
stance, accessing http://server.com/images/ will 
return the Directory Index file listed for the images direc- 
tory. However, the Indexes option is also activated for 
the document root directory. This option allows Apache 
to send the list of the files in the directory if no specific 
file in that directory is requested. The combination of 
these two options causes Apache to serve the list of files 
in the main document directory instead of the index file. 
ConfAid determines that the content sent to the user is 
dependent on the Indexes and related options first and 
the DirectoryIndex option next. Thus, the root cause gets 
ranked lower in the list. This ordering is a direct result 
of the heuristic discussed in Section 3.4 that considers 
branches closer to the erroneous behavior to be more 
likely to lead to the root cause than those farther away. 

For Postfix, ConfAid diagnoses 85% of the errors ef- 
fectively. The remaining 3 errors are due to missing 
configuration options. Currently, ConfAid only consid- 
ers all tokens present in the configuration file as possi- 
ble sources of the root cause. If a default value can be 
overridden by a token not actually in the file, then Conf- 
Aid will not detect the missing token as a possible root 
cause. Based on these results, we plan to extend our al- 
ternate path analysis to look for tokens that could be read 
from the config file along branches that are not actually 
executed. We can taint variables modified along those 
branches with a value that is dependent upon the branch 
conditions that led to that path. 

Overall, ConfAid successfully diagnosed 55 out of 60 
random errors by ranking the actual root cause first or 
second. Out of the remaining 5 errors, we believe that 4 
(the OpenSSH server error and the three Postfix errors) 
can be diagnosed with further improvements to the Conf- 
Aid implementation. The remaining error (the Apache 
error) is a direct result of our weighting heuristic and 
seems hard for ConfAid to diagnose correctly. 


5 Related work 


Several prior research efforts have applied different tech- 
niques to the problem of configuration troubleshooting. 
Unlike ConfAid, most prior systems have taken a black 
box approach that uses only state external to the applica- 
tion being debugged to infer the problem. 

PeerPressure [38] and its predecessor, Strider [39], use 
statistical methods to compare configuration state in the 
Windows registry on different machines. When a value 
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Application | root causes | root causes ranked root causes root causes ranked | root causes ranked | Avg. time 
ranked first first with one tie ranked second | second with one tie | worse than second to run 


OpensSH_| 17 85%) | 16%) io [| 0 | —*16%) 


75% | __16% [0 | 16% 16%) 





sm [| 0 | 200% [0 315%) 


Table 2: Random fault injection results 


on a machine exhibiting erroneous behavior differs from 
the value usually chosen by other machines, PeerPres- 
sure flags the value as a potential error. This approach 
works well as long as the majority configuration is ap- 
propriate for the target machine; however, PeerPressure 
and Strider cannot separate custom configuration vari- 
ables from erroneous ones since they do not observe how 
applications actually use those values. In contrast, Conf- 
Aid can differentiate these cases by observing how the 
values are used inside the application binary. 

Chronus [40] also compares multiple configuration 
states, Instead of comparing states across computers, it 
uses virtual machine checkpoint and rollback to “time 
travel” through states on the same machine, looking for 
the instance in which the program behavior on a particu- 
lar test case switched from correct to incorrect. 

Other projects monitor state external to applications 
as they run. Cohen et al. [15] use statistical techniques 
to help troubleshoot performance issues by correlating 
those issues with low-level performance statistics for the 
CPU, disk, and other system components. AutoBash [37] 
traces causality inside the OS by monitoring system call 
execution, but treats execution inside each process as a 
black box. AutoBash can suggest that a particular con- 
figuration file may be erroneous, but it cannot identify 
the specific value within the file that is at fault. 

Our previous work on misconfiguration diagnosis [4] 
uses the application’s system call trace to extract the files 
and processes on which the application causally depends. 
It then generates a signature based on those dependencies 
to represent the misconfiguration and search for the sig- 
nature in a database of known bugs. Clarify [21] uses 
similar execution signatures to improve error reporting. 
It uses program features such as function call counts, call 
sites, and stack dumps to generate the signatures. The 
improved error reporting, although helpful, does not di- 
agnose the root cause. 

In contrast to all these projects, ConfAid takes a white 
box approach to configuration troubleshooting by moni- 
toring causality within the program binary as it executes. 
Thus, ConfAid can observe the actual dependencies as 
they are introduced rather than inferring them through 
statistical and other methods. 

Two recent systems apply white box analysis to a re- 
lated problem: helping developers replicate a problem 
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experienced in the field. SherLog [43] and ESD [44] 
both use static analysis and symbolic execution to infer 
the execution path of the application. SherLog uses log 
messages and ESD leverages the bug report generated by 
the application to constrain the execution path. Both of 
these systems can replicate an execution path that derives 
from a misconfiguration. However, they make different 
design decisions than ConfAid, driven by their differ- 
ent use case. They both require application source code, 
and SherLog also may require guidance from developers 
about which functions should be symbolically executed. 
This is appropriate for a tool used by software experts, 
but less so for one like ConfAid that is targeted at ad- 
ministrators and users. More generally, symbolic execu- 
tion systems have been applied to model checking file 
systems and other complex software systems [9, 41]. 


A number of systems trace causality external to pro- 
cesses to debug configuration and performance issues in 
distributed systems. For example, the work of Aguilera 
et al. [2] and Magpie [5, 6] trace RPCs and other net- 
work communication to debug performance problems. 
Causeway [11] allows applications to inject metadata 
that follows causal paths for distributed applications. 
Pinpoint [13] traces middleware and communications be- 
tween components in a distributed system and statis- 
tically correlates traces with success and failure data. 
Follow-on work to Pinpoint [12] adds the abstraction of 
causal paths that link black-box components. ConfAid 
and these systems share the common idea of propagat- 
ing causal information among distributed components; 
however, ConfAid also propagates causal information 
within processes, which allows it to precisely determine 
the causal relationships between inputs and outputs. 


More generally, many systems reason about causal in- 
teractions in the operating system and in distributed sys- 
tems. For example, taint tracking [34] monitors data flow 
dependencies to determine when input data is used in 
an insecure manner. ConfAid uses the same approach 
for data flow analysis, but applies it to a different do- 
main. Dytan [14] proposes a generic dynamic taint anal- 
ysis framework to ease the implementation of various 
taint-based techniques. ConfAid enhances the basic dy- 
namic taint analysis with essential heuristics and ap- 
plies it to configuration troubleshooting problem. Red- 
Flag [17] uses data flow analysis to reduce the leaks of 
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sensitive information by personal machines. Resin [42] 
uses application-level data flow assertions to improve 
the security of applications. Decentralized informa- 
tion flow [32, 45] monitors both control flow and data 
flow dependencies to determine if a code component 
leaks information that it is not authorized to divulge. 
BackTracker [26] traces causal interactions to determine 
what state has been changed during an intrusion. As- 
bestos [19] and HiStar [46] monitor causality in the OS 
to prevent inadvertent disclosure of private data. 

Program slicing [1, 48, 47], intended to aid in debug- 
ging, is a more general approach that determines which 
statements could affect the value of a variable using a 
backward or forward computations. ConfAid applies 
similar data and control flow analysis techniques to a new 
problem, namely determining the root causes of miscon- 
figurations. 


6 Conclusion 


Configuration errors are costly, time-consuming, and 
frustrating to troubleshoot. | ConfAid makes trou- 
bleshooting easier by pinpointing the specific token in a 
configuration file that led to an erroneous behavior. Com- 
pared to prior approaches, ConfAid distinguishes itself 
by analyzing causality within processes as they execute 
without the need for application source code. It propa- 
gates causal dependencies among multiple processes and 
outputs a ranked list of probable root causes. Our results 
show that ConfAid usually lists the actual root cause as 
the first or second entry in this list. Thus, ConfAid can 
substantially reduce total time to recovery and perhaps 
make configuration problems a little less frustrating. 
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Abstract 


Updating an index of the web as documents are 
crawled requires continuously transforming a large 
repository of existing documents as new documents ar- 
rive. This task is one example of a class of data pro- 
cessing tasks that transform a large repository of data 
via small, independent mutations. These tasks lie in a 
gap between the capabilities of existing infrastructure. 
Databases do not meet the storage or throughput require- 
ments of these tasks: Google’s indexing system stores 
tens of petabytes of data and processes billions of up- 
dates per day on thousands of machines. MapReduce and 
other batch-processing systems cannot process small up- 
dates individually as they rely on creating large batches 
for efficiency. 

We have built Percolator, a system for incrementally 
processing updates to a large data set, and deployed it 
to create the Google web search index. By replacing a 
batch-based indexing system with an indexing system 
based on incremental processing using Percolator, we 
process the same number of documents per day, while 
reducing the average age of documents in Google search 
results by 50%. 


1. Introduction 


Consider the task of building an index of the web that 
can be used to answer search queries. The indexing sys- 
tem starts by crawling every page on the web and pro- 
cessing them while maintaining a set of invariants on the 
index. For example, if the same content is crawled un- 
der multiple URLs, only the URL with the highest Page- 
Rank [28] appears in the index. Each link is also inverted 
so that the anchor text from each outgoing link is at- 
tached to the page the link points to. Link inversion must 
work across duplicates: links to a duplicate of a page 
should be forwarded to the highest PageRank duplicate 
if necessary. 

This is a bulk-processing task that can be expressed 
as a series of MapReduce [13] operations: one for clus- 
tering duplicates, one for link inversion, etc. It’s easy to 
maintain invariants since MapReduce limits the paral- 
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lelism of the computation; all documents finish one pro- 
cessing step before starting the next. For example, when 
the indexing system is writing inverted links to the cur- 
rent highest-PageRank URL, we need not worry about 
its PageRank concurrently changing; a previous MapRe- 
duce step has already determined its PageRank. 

Now, consider how to update that index after recrawl- 
ing some small portion of the web. It’s not sufficient to 
run the MapReduces over just the new pages since, for 
example, there are links between the new pages and the 
rest of the web. The MapReduces must be run again over 
the entire repository, that is, over both the new pages 
and the old pages. Given enough computing resources, 
MapReduce’s scalability makes this approach feasible, 
and, in fact, Google’s web search index was produced 
in this way prior to the work described here. However, 
reprocessing the entire web discards the work done in 
earlier runs and makes latency proportional to the size of 
the repository, rather than the size of an update. 

The indexing system could store the repository in a 
DBMS and update individual documents while using 
transactions to maintain invariants. However, existing 
DBMSs can’t handle the sheer volume of data: Google’s 
indexing system stores tens of petabytes across thou- 
sands of machines [30]. Distributed storage systems like 
Bigtable [9] can scale to the size of our repository but 
don’t provide tools to help programmers maintain data 
invariants in the face of concurrent updates. 

An ideal data processing system for the task of main- 
taining the web search index would be optimized for in- 
cremental processing; that is, it would allow us to main- 
tain a very large repository of documents and update it 
efficiently as each new document was crawled. Given 
that the system will be processing many small updates 
concurrently, an ideal system would also provide mech- 
anisms for maintaining invariants despite concurrent up- 
dates and for keeping track of which updates have been 
processed. 

The remainder of this paper describes a particular in- 
cremental processing system: Percolator. Percolator pro- 
vides the user with random access to a multi-PB reposi- 
tory. Random access allows us to process documents in- 
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Figure 1: Percolator and its dependencies 


dividually, avoiding the global scans of the repository 
that MapReduce requires. To achieve high throughput, 
many threads on many machines need to transform the 
repository concurrently, so Percolator provides ACID- 
compliant transactions to make it easier for programmers 
to reason about the state of the repository; we currently 
implement snapshot isolation semantics [5]. 

In addition to reasoning about concurrency, program- 
mers of an incremental system need to keep track of the 
state of the incremental computation. To assist them in 
this task, Percolator provides observers: pieces of code 
that are invoked by the system whenever a user-specified 
column changes. Percolator applications are structured 
as a series of observers; each observer completes a task 
and creates more work for “downstream” observers by 
writing to the table. An external process triggers the first 
observer in the chain by writing initial data into the table. 

Percolator was built specifically for incremental pro- 
cessing and is not intended to supplant existing solutions 
for most data processing tasks. Computations where the 
result can’t be broken down into small updates (sorting 
a file, for example) are better handled by MapReduce. 
Also, the computation should have strong consistency 
requirements; otherwise, Bigtable is sufficient. Finally, 
the computation should be very large in some dimen- 
sion (total data size, CPU required for transformation, 
etc.); smaller computations not suited to MapReduce or 
Bigtable can be handled by traditional DBMSs. 

Within Google, the primary application of Percola- 
tor is preparing web pages for inclusion in the live web 
search index. By converting the indexing system to an 
incremental system, we are able to process individual 
documents as they are crawled. This reduced the aver- 
age document processing latency by a factor of 100, and 
the average age of a document appearing in a search re- 
sult dropped by nearly 50 percent (the age of a search re- 
sult includes delays other than indexing such as the time 
between a document being changed and being crawled). 
The system has also been used to render pages into 
images; Percolator tracks the relationship between web 
pages and the resources they depend on, so pages can be 
reprocessed when any depended-upon resources change. 


2 Design 


Percolator provides two main abstractions for per- 
forming incremental processing at large scale: ACID 
transactions over a random-access repository and ob- 
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servers, a way to organize an incremental computation. 

A Percolator system consists of three binaries that run 
on every machine in the cluster: a Percolator worker, a 
Bigtable [9] tablet server, and a GFS [20] chunkserver. 
All observers are linked into the Percolator worker, 
which scans the Bigtable for changed columns (“noti- 
fications”) and invokes the corresponding observers as 
a function call in the worker process. The observers 
perform transactions by sending read/write RPCs to 
Bigtable tablet servers, which in turn send read/write 
RPCs to GFS chunkservers. The system also depends 
on two small services: the timestamp oracle and the 
lightweight lock service. The timestamp oracle pro- 
vides strictly increasing timestamps: a property required 
for correct operation of the snapshot isolation protocol. 
Workers use the lightweight lock service to make the 
search for dirty notifications more efficient. 

From the programmer’s perspective, a Percolator 
repository consists of a small number of tables. Each 
table is a collection of “cells” indexed by row and col- 
umn. Each cell contains a value: an uninterpreted array of 
bytes. (Internally, to support snapshot isolation, we rep- 
resent each cell as a series of values indexed by times- 
tamp.) 

The design of Percolator was influenced by the re- 
quirement to run at massive scales and the lack of a 
requirement for extremely low latency. Relaxed latency 
requirements let us take, for example, a lazy approach 
to cleaning up locks left behind by transactions running 
on failed machines. This lazy, simple-to-implement ap- 
proach potentially delays transaction commit by tens of 
seconds. This delay would not be acceptable in a DBMS 
running OLTP tasks, but it is tolerable in an incremental 
processing system building an index of the web. Percola- 
tor has no central location for transaction management; 
in particular, it lacks a global deadlock detector. This in- 
creases the latency of conflicting transactions but allows 
the system to scale to thousands of machines. 


2.1 Bigtable overview 


Percolator is built on top of the Bigtable distributed 
storage system. Bigtable presents a multi-dimensional 
sorted map to users: keys are (row, column, times- 
tamp) tuples. Bigtable provides lookup and update oper- 
ations on each row, and Bigtable row transactions enable 
atomic read-modify-write operations on individual rows. 
Bigtable handles petabytes of data and runs reliably on 
large numbers of (unreliable) machines. 

A running Bigtable consists of a collection of tablet 
servers, each of which is responsible for serving several 
tablets (contiguous regions of the key space). A master 
coordinates the operation of tablet servers by, for exam- 
ple, directing them to load or unload tablets. A tablet is 
stored as a collection of read-only files in the Google 
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SSTable format. SSTables are stored in GFS; Bigtable 
relies on GFS to preserve data in the event of disk loss. 
Bigtable allows users to control the performance charac- 
teristics of the table by grouping a set of columns into 
a locality group. The columns in each locality group are 
stored in their own set of SSTables, which makes scan- 
ning them less expensive since the data in other columns 
need not be scanned. 


The decision to build on Bigtable defined the over- 
all shape of Percolator. Percolator maintains the gist of 
Bigtable’s interface: data is organized into Bigtable rows 
and columns, with Percolator metadata stored along- 
side in special columns (see Figure 5). Percolator’s 
API closely resembles Bigtable’s API: the Percolator li- 
brary largely consists of Bigtable operations wrapped in 
Percolator-specific computation. The challenge, then, in 
implementing Percolator is providing the features that 
Bigtable does not: multirow transactions and the ob- 
server framework. 


2.2 Transactions 


Percolator provides cross-row, cross-table transac- 
tions with ACID snapshot-isolation semantics. Percola- 
tor users write their transaction code in an imperative 
language (currently C++) and mix calls to the Percola- 
tor API with their code. Figure 2 shows a simplified ver- 
sion of clustering documents by a hash of their contents. 
In this example, if Commit() returns false, the transac- 
tion has conflicted (in this case, because two URLs with 
the same content hash were processed simultaneously) 
and should be retried after a backoff. Calls to Get(Q) and 
Commit() are blocking; parallelism is achieved by run- 
ning many transactions simultaneously in a thread pool. 


While it is possible to incrementally process data with- 
out the benefit of strong transactions, transactions make 
it more tractable for the user to reason about the state of 
the system and to avoid the introduction of errors into 
a long-lived repository. For example, in a transactional 
web-indexing system the programmer can make assump- 
tions like: the hash of the contents of a document is al- 
ways consistent with the table that indexes duplicates. 
Without transactions, an ill-timed crash could result in a 
permanent error: an entry in the document table that cor- 
responds to no URL in the duplicates table. Transactions 
also make it easy to build index tables that are always 
up to date and consistent. Note that both of these exam- 
ples require transactions that span rows, rather than the 
single-row transactions that Bigtable already provides. 

Percolator stores multiple versions of each data item 
using Bigtable’s timestamp dimension. Multiple versions 
are required to provide snapshot isolation [5], which 
presents each transaction with the appearance of reading 
from a stable snapshot at some timestamp. Writes appear 
in a different, later, timestamp. Snapshot isolation pro- 


bool UpdateDocument(Document doc) { 
Transaction t(&cluster); 
t.Set(doc.url(), "contents", "document", doc.contents()); 
int hash = Hash(doc.contents()); 


// dups table maps hash — canonical URL 
string canonical; 
if (!t-Get(hash, "canonical-url", "dups", &canonical)) { 
// No canonical yet; write myself in 
t.Set(hash, "canonical-url", "dups", doc.url()); 
} / else this document already exists, ignore new copy 
return t.Commit(); 


} 


Figure 2: Example usage of the Percolator API to perform ba- 
sic checksum clustering and eliminate documents with the same 
content. 


Time 
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Figure 3: Transactions under snapshot isolation perform reads 
at a start timestamp (represented here by an open square) and 
writes at a commit timestamp (closed circle). In this example, 
transaction 2 would not see writes from transaction 1 since trans- 
action 2’s start timestamp is before transaction 1’s commit times- 
tamp. Transaction 3, however, will see writes from both 1 and 2. 
Transaction 1 and 2 are running concurrently: if they both write 
the same cell, at least one will abort. 


tects against write-write conflicts: if transactions A and 
B, running concurrently, write to the same cell, at most 
one will commit. Snapshot isolation does not provide 
serializability; in particular, transactions running under 
snapshot isolation are subject to write skew [5]. The main 
advantage of snapshot isolation over a serializable proto- 
col is more efficient reads. Because any timestamp rep- 
resents a consistent snapshot, reading a cell requires only 
performing a Bigtable lookup at the given timestamp; ac- 
quiring locks is not necessary. Figure 3 illustrates the re- 
lationship between transactions under snapshot isolation. 

Because it is built as a client library accessing 
Bigtable, rather than controlling access to storage itself, 
Percolator faces a different set of challenges implement- 
ing distributed transactions than traditional PDBMSs. 
Other parallel databases integrate locking into the sys- 
tem component that manages access to the disk: since 
each node already mediates access to data on the disk it 
can grant locks on requests and deny accesses that violate 
locking requirements. 

By contrast, any node in Percolator can (and does) is- 
sue requests to directly modify state in Bigtable: there is 
no convenient place to intercept traffic and assign locks. 
As a result, Percolator must explicitly maintain locks. 
Locks must persist in the face of machine failure; if a 
lock could disappear between the two phases of com- 
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key bal:data bal:lock bal:write 
6: 6: 6: data @ 5 

ee 5: $10 5: 5: 

Joe 6: 6: 6: data @ 5 
5: $2 5: 5 











1. Initial state: Joe’s account contains $2 dollars, Bob’s $10. 








7:$3 7: 1am primary 7: 

Bob 6: 6: 6: data @ 5 
5: $10 a3 ae 

Toe | 6: | 6: | 6: data @ 5 
5: $2 5 a: 


2. The transfer transaction begins by locking Bob’s account 
balance by writing the lock column. This lock is the primary 
for the transaction. The transaction also writes data at its start 
timestamp, 7. 








7: $3 7: Tam primary 7: 

Bob 6: 6: 6: data @ 5 
5: $10 3: 2: 
7: $9 7: primary @ Bob.bal | 7: 

Joe 6: 6: 6: data @ 5 
5: $2 5: a 











3. The transaction now locks Joe’s account and writes Joe’s new 
balance (again, at the start timestamp). The lock is a secondary 
for the transaction and contains a reference to the primary lock 
(stored in row “Bob,” column “bal”); in case this lock is stranded 
due to a crash, a transaction that wishes to clean up the lock 
needs the location of the primary to synchronize the cleanup. 





8: 8: 8: data @ 7 
7: $3 ve i: 

no 6: 6: 6: data @ 5 
5: $10 a; 3s 
7: $9 7: primary @ Bob.bal 7: 

Joe 6: 6: 6:data @ 5 
5: $2 3: 3: 


4. The transaction has now reached the commit point: it erases 
the primary lock and replaces it with a write record at a new 
timestamp (called the commit timestamp): 8. The write record 
contains a pointer to the timestamp where the data is stored. 
Future readers of the column “bal” in row “Bob” will now see the 
value $3. 





8: 8: 8: data @ 7 
7: $3 de 7: 
Bot 6: 6: 6: data @ 5 
5: $10 a a 
8: 8: 8: data @ 7 
Toe 7: $9 7: 7: 
6: 6: 6: data @ 5 
5:$2 a: 5: 


5. The transaction completes by adding write records and 
deleting locks at the secondary cells. In this case, there is only 
one secondary: Joe. 


Figure 4: This figure shows the Bigtable writes performed by 
a Percolator transaction that mutates two rows. The transaction 
transfers 7 dollars from Bob to Joe. Each Percolator column is 
stored as 3 Bigtable columns: data, write metadata, and lock 
metadata. Bigtable’s timestamp dimension is shown within each 
cell; 12: “data” indicates that “data” has been written at Bigtable 
timestamp 12. Newly written data is shown in boldface. 
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Column Use 

c:lock An uncommitted transaction is writing this 
cell; contains the location of primary lock 

c:write | Committed data present; stores the Bigtable 
timestamp of the data 

c:data Stores the data itself 

c:notify Hint: observers may need to run 

c:ack_O Observer “O” has run ; stores start timestamp 


of successful last run 


Figure 5: The columns in the Bigtable representation of a Per- 
colator column named “c.” 


mit, the system could mistakenly commit two transac- 
tions that should have conflicted. The lock service must 
provide high throughput; thousands of machines will be 
requesting locks simultaneously. The lock service should 
also be low-latency; each Get() operation requires read- 
ing locks in addition to data, and we prefer to minimize 
this latency. Given these requirements, the lock server 
will need to be replicated (to survive failure), distributed 
and balanced (to handle load), and write to a persistent 
data store. Bigtable itself satisfies all of our requirements, 
and so Percolator stores its locks in special in-memory 
columns in the same Bigtable that stores data and reads 
or modifies the locks in a Bigtable row transaction when 
accessing data in that row. 

We’ll now consider the transaction protocol in more 
detail. Figure 6 shows the pseudocode for Percolator 
transactions, and Figure 4 shows the layout of Percolator 
data and metadata during the execution of a transaction. 
These various metadata columns used by the system are 
described in Figure 5. The transaction’s constructor asks 
the timestamp oracle for a start timestamp (line 6), which 
determines the consistent snapshot seen by Get(). Calls 
to SetQ are buffered (line 7) until commit time. The ba- 
sic approach for committing buffered writes is two-phase 
commit, which is coordinated by the client. Transactions 
on different machines interact through row transactions 
on Bigtable tablet servers. 

In the first phase of commit (“prewrite’’), we try to 
lock all the cells being written. (To handle client failure, 
we designate one lock arbitrarily as the primary; we’ll 
discuss this mechanism below.) The transaction reads 
metadata to check for conflicts in each cell being writ- 
ten. There are two kinds of conflicting metadata: if the 
transaction sees another write record after its start times- 
tamp, it aborts (line 32); this is the write-write conflict 
that snapshot isolation guards against. If the transaction 
sees another lock at any timestamp, it also aborts (line 
34). It’s possible that the other transaction is just being 
slow to release its lock after having already committed 
below our start timestamp, but we consider this unlikely, 
so we abort. If there is no conflict, we write the lock and 
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26 // Prewrite tries to lock cell w, returning false in case of conflict. 


class Transaction { 

struct Write { Row row; Column col; string value; }; 
vector<Write> writes_} 

int start_ts_; 


Transaction() : start_ts_(oracle.GetTimestamp()) {} 

void Set(Write w) { writes_.push_back(w); } 

bool Get(Row row, Column c, string* value) { 

while (true) { 
bigtable::Txn T = bigtable::StartRowTransaction(row); 
// Check for locks that signal concurrent writes. 
if (T.Read(row, c+"lock", [0, start_ts_])) { 
// There is a pending lock; try to clean it and wait 
BackoffAndMaybeCleanupLock(row, c); 
continue; 


} 


// Find the latest write below our start_timestamp. 
latest_write = T.Read(row, c+"write", [0, start_ts_]); 
if (!latest_write.found()) return false; // no data 

int data_ts = latest_write.start_timestamp(); 

*value = T.Read(row, c+"data", [data_ts, data_ts]); 
return true; 


} 


27 bool Prewrite(Write w, Write primary) { 
28 Column c = w.col; 
29 bigtable::Txn T = bigtable::StartRowTransaction(w.row); 
30 
31 // Abort on writes after our start timestamp ... 
32 if (T.Read(w.row, c+"write", [start_ts_, oo])) return false; 
33 // ... or locks at any timestamp. 
34 if (T.Read(w.row, c+"lock", [0, co])) return false; 
35 
36 T.Write(w.row, c+"data", start_ts_, w.value); 
37 T.Write(w.row, c+"lock", start_ts_, 
38 {primary.row, primary.col}); // The primary’s location. 
39 return T.Commit(); 
40 
41 bool Commit() { 
42 Write primary = writes_[0]; 
43 vector<Write> secondaries(writes_.begin()+1, writes_.end()); 
44 if (!Prewrite(primary, primary)) return false; 
45 for (Write w : secondaries) 
46 if (!Prewrite(w, primary)) return false; 
47 
48 int commit_ts = oracle_.GetTimestamp(); 
49 
50  // Commit primary first. 
51 Write p = primary; 
52 bigtable::Txn T = bigtable::StartRowTransaction(p.row); 
53 if (!T.Read(p.row, p.colt"lock", [start_ts_, start_ts_])) 
54 return false; = // aborted while working 
55 T.Write(p.row, p.col+"write", commit_ts, 
56 start_ts_); // Pointer to data written at start_ts_. 
57 T.Erase(p.row, p.colt"lock", commit-_ts); 
58 if ('‘T.-Commit()) return false; // commit point 
59 
60 // Second phase: write out write records for secondary cells. 
61 for (Write w : secondaries) { 
62 bigtable::Write(w.row, w.col+"write", commit_ts, start_ts_); 
63 bigtable::Erase(w.row, w.colt"lock", commit_ts); 
64} 
65 return true; 
66 
67 } YW class Transaction 
Figure 6: Pseudocode for Percolator transaction protocol. 
USENIX Association 


the data to each cell at the start timestamp (lines 36-38). 

If no cells conflict, the transaction may commit and 
proceeds to the second phase. At the beginning of the 
second phase, the client obtains the commit timestamp 
from the timestamp oracle (line 48). Then, at each cell 
(starting with the primary), the client releases its lock and 
make its write visible to readers by replacing the lock 
with a write record. The write record indicates to read- 
ers that committed data exists in this cell; it contains a 
pointer to the start timestamp where readers can find the 
actual data. Once the primary’s write is visible (line 58), 
the transaction must commit since it has made a write 
visible to readers. 

A Get() operation first checks for a lock in the times- 
tamp range [0, start_timestamp], which is the range of 
timestamps visible in the transaction’s snapshot (line 12). 
If a lock is present, another transaction is concurrently 
writing this cell, so the reading transaction must wait un- 
til the lock is released. If no conflicting lock is found, 
Get() reads the latest write record in that timestamp range 
(line 19) and returns the data item corresponding to that 
write record (line 22). 

Transaction processing is complicated by the possibil- 
ity of client failure (tablet server failure does not affect 
the system since Bigtable guarantees that written locks 
persist across tablet server failures). If a client fails while 
a transaction is being committed, locks will be left be- 
hind. Percolator must clean up those locks or they will 
cause future transactions to hang indefinitely. Percolator 
takes a lazy approach to cleanup: when a transaction A 
encounters a conflicting lock left behind by transaction 
B, A may determine that B has failed and erase its locks. 

It is very difficult for A to be perfectly confident in 
its judgment that B is failed; as a result we must avoid 
a race between A cleaning up B’s transaction and a not- 
actually-failed B committing the same transaction. Per- 
colator handles this by designating one cell in every 
transaction as a synchronizing point for any commit or 
cleanup operations. This cell’s lock is called the primary 
lock. Both A and B agree on which lock is primary (the 
location of the primary is written into the locks at all 
other cells). Performing either a cleanup or commit op- 
eration requires modifying the primary lock; since this 
modification is performed under a Bigtable row transac- 
tion, only one of the cleanup or commit operations will 
succeed. Specifically: before B commits, it must check 
that it still holds the primary lock and replace it with a 
write record. Before A erases B’s lock, A must check 
the primary to ensure that B has not committed; if the 
primary lock is still present, then it can safely erase the 
lock. 

When a client crashes during the second phase of 
commit, a transaction will be past the commit point 
(it has written at least one write record) but will still 
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have locks outstanding. We must perform roll-forward on 
these transactions. A transaction that encounters a lock 
can distinguish between the two cases by inspecting the 
primary lock: if the primary lock has been replaced by a 
write record, the transaction which wrote the lock must 
have committed and the lock must be rolled forward, oth- 
erwise it should be rolled back (since we always commit 
the primary first, we can be sure that it is safe to roll back 
if the primary is not committed). To roll forward, the 
transaction performing the cleanup replaces the stranded 
lock with a write record as the original transaction would 
have done. 


Since cleanup is synchronized on the primary lock, it 
is safe to clean up locks held by live clients; however, 
this incurs a performance penalty since rollback forces 
the transaction to abort. So, a transaction will not clean 
up a lock unless it suspects that a lock belongs to a dead 
or stuck worker. Percolator uses simple mechanisms to 
determine the liveness of another transaction. Running 
workers write a token into the Chubby lockservice [8] 
to indicate they belong to the system; other workers can 
use the existence of this token as a sign that the worker is 
alive (the token is automatically deleted when the process 
exits). To handle a worker that is live, but not working, 
we additionally write the wall time into the lock; a lock 
that contains a too-old wall time will be cleaned up even 
if the worker’s liveness token is valid. To handle long- 
running commit operations, workers periodically update 
this wall time while committing. 


2.3. Timestamps 


The timestamp oracle is a server that hands out times- 
tamps in strictly increasing order. Since every transaction 
requires contacting the timestamp oracle twice, this ser- 
vice must scale well. The oracle periodically allocates 
a range of timestamps by writing the highest allocated 
timestamp to stable storage; given an allocated range of 
timestamps, the oracle can satisfy future requests strictly 
from memory. If the oracle restarts, the timestamps will 
jump forward to the maximum allocated timestamp (but 
will never go backwards). To save RPC overhead (at the 
cost of increasing transaction latency) each Percolator 
worker batches timestamp requests across transactions 
by maintaining only one pending RPC to the oracle. As 
the oracle becomes more loaded, the batching naturally 
increases to compensate. Batching increases the scalabil- 
ity of the oracle but does not affect the timestamp guar- 
antees. Our oracle serves around 2 million timestamps 
per second from a single machine. 

The transaction protocol uses strictly increasing times- 
tamps to guarantee that Get() returns all committed 
writes before the transaction’s start timestamp. To see 
how it provides this guarantee, consider a transaction R 
reading at timestamp Tp and a transaction W that com- 
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mitted at timestamp Ty < T'p; we will show that R sees 
W’s writes. Since Tw < Tr, we know that the times- 
tamp oracle gave out Ty before or in the same batch 
as TR; hence, W requested Ti before R received Tp. 
We know that R can’t do reads before receiving its start 
timestamp T'p and that W wrote locks before requesting 
its commit timestamp Ty. Therefore, the above property 
guarantees that W must have at least written all its locks 
before R did any reads; R’s Get() will see either the fully- 
committed write record or the lock, in which case W will 
block until the lock is released. Either way, W’s write is 
visible to R’s Get(). 


2.4 Notifications 


Transactions let the user mutate the table while main- 
taining invariants, but users also need a way to trigger 
and run the transactions. In Percolator, the user writes 
code (“observers’’) to be triggered by changes to the ta- 
ble, and we link all the observers into a binary running 
alongside every tablet server in the system. Each ob- 
server registers a function and a set of columns with Per- 
colator, and Percolator invokes the function after data is 
written to one of those columns in any row. 

Percolator applications are structured as a series of ob- 
servers; each observer completes a task and creates more 
work for “downstream” observers by writing to the table. 
In our indexing system, a MapReduce loads crawled doc- 
uments into Percolator by running loader transactions, 
which trigger the document processor transaction to in- 
dex the document (parse, extract links, etc.). The docu- 
ment processor transaction triggers further transactions 
like clustering. The clustering transaction, in turn, trig- 
gers transactions to export changed document clusters to 
the serving system. 


Notifications are similar to database triggers or events 
in active databases [29], but unlike database triggers, 
they cannot be used to maintain database invariants. In 
particular, the triggered observer runs in a separate trans- 
action from the triggering write, so the triggering write 
and the triggered observer’s writes are not atomic. No- 
tifications are intended to help structure an incremental 
computation rather than to help maintain data integrity. 

This difference in semantics and intent makes observer 
behavior much easier to understand than the complex se- 
mantics of overlapping triggers. Percolator applications 
consist of very few observers — the Google indexing 
system has roughly 10 observers. Each observer is ex- 
plicitly constructed in the main() of the worker binary, 
so it is clear what observers are active. It is possible for 
several observers to observe the same column, but we 
avoid this feature so it is clear what observer will run 
when a particular column is written. Users do need to be 
wary about infinite cycles of notifications, but Percolator 
does nothing to prevent this; the user typically constructs 
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a series of observers to avoid infinite cycles. 

We do provide one guarantee: at most one observer’s 
transaction will commit for each change of an observed 
column. The converse is not true, however: multiple 
writes to an observed column may cause the correspond- 
ing observer to be invoked only once. We call this feature 
message collapsing, since it helps avoid computation by 
amortizing the cost of responding to many notifications. 
For example, it is sufficient for http: //google.com 
to be reprocessed periodically rather than every time we 
discover a new link pointing to it. 

To provide these semantics for notifications, each ob- 
served column has an accompanying “acknowledgment” 
column for each observer, containing the latest start 
timestamp at which the observer ran. When the observed 
column is written, Percolator starts a transaction to pro- 
cess the notification. The transaction reads the observed 
column and its corresponding acknowledgment column. 
If the observed column was written after its last acknowl- 
edgment, then we run the observer and set the acknowl- 
edgment column to our start timestamp. Otherwise, the 
observer has already been run, so we do not run it again. 
Note that if Percolator accidentally starts two transac- 
tions concurrently for a particular notification, they will 
both see the dirty notification and run the observer, but 
one will abort because they will conflict on the acknowl- 
edgment column. We promise that at most one observer 
will commit for each notification. 

To implement notifications, Percolator needs to effi- 
ciently find dirty cells with observers that need to be run. 
This search is complicated by the fact that notifications 
are rare: our table has trillions of cells, but, if the system 
is keeping up with applied load, there will only be mil- 
lions of notifications. Additionally, observer code is run 
on a large number of client processes distributed across a 
collection of machines, meaning that this search for dirty 
cells must be distributed. 

To identify dirty cells, Percolator maintains a special 
“notify” Bigtable column, containing an entry for each 
dirty cell. When a transaction writes an observed cell, 
it also sets the corresponding notify cell. The workers 
perform a distributed scan over the notify column to find 
dirty cells. After the observer is triggered and the transac- 
tion commits, we remove the notify cell. Since the notify 
column is just a Bigtable column, not a Percolator col- 
umn, it has no transactional properties and serves only as 
a hint to the scanner to check the acknowledgment col- 
umn to determine if the observer should be run. 

To make this scan efficient, Percolator stores the notify 
column in a separate Bigtable locality group so that scan- 
ning over the column requires reading only the millions 
of dirty cells rather than the trillions of total data cells. 
Each Percolator worker dedicates several threads to the 
scan. For each thread, the worker chooses a portion of the 


table to scan by first picking a random Bigtable tablet, 
then picking a random key in the tablet, and finally scan- 
ning the table from that position. Since each worker is 
scanning a random region of the table, we worry about 
two workers running observers on the same row con- 
currently. While this behavior will not cause correctness 
problems due to the transactional nature of notifications, 
it is inefficient. To avoid this, each worker acquires a lock 
from a lightweight lock service before scanning the row. 
This lock server need not persist state since it is advisory 
and thus is very scalable. 


The random-scanning approach requires one addi- 
tional tweak: when it was first deployed we noticed that 
scanning threads would tend to clump together in a few 
regions of the table, effectively reducing the parallelism 
of the scan. This phenomenon is commonly seen in pub- 
lic transportation systems where it is known as “platoon- 
ing” or “bus clumping” and occurs when a bus is slowed 
down (perhaps by traffic or slow loading). Since the num- 
ber of passengers at each stop grows with time, loading 
delays become even worse, further slowing the bus. Si- 
multaneously, any bus behind the slow bus speeds up 
as it needs to load fewer passengers at each stop. The 
result is a clump of buses arriving simultaneously at a 
stop [19]. Our scanning threads behaved analogously: a 
thread that was running observers slowed down while 
threads “behind” it quickly skipped past the now-clean 
rows to clump with the lead thread and failed to pass 
the lead thread because the clump of threads overloaded 
tablet servers. To solve this problem, we modified our 
system in a way that public transportation systems can- 
not: when a scanning thread discovers that it is scanning 
the same row as another thread, it chooses a new random 
location in the table to scan. To further the transporta- 
tion analogy, the buses (scanner threads) in our city avoid 
clumping by teleporting themselves to a random stop (lo- 
cation in the table) if they get too close to the bus in front 
of them. 


Finally, experience with notifications led us to intro- 
duce a lighter-weight but semantically weaker notifica- 
tion mechanism. We found that when many duplicates of 
the same page were processed concurrently, each trans- 
action would conflict trying to trigger reprocessing of the 
same duplicate cluster. This led us to devise a way to no- 
tify a cell without the possibility of transactional conflict. 
We implement this weak notification by writing only to 
the Bigtable “notify” column. To preserve the transac- 
tional semantics of the rest of Percolator, we restrict these 
weak notifications to a special type of column that can- 
not be written, only notified. The weaker semantics also 
mean that multiple observers may run and commit as a 
result of a single weak notification (though the system 
tries to minimize this occurrence). This has become an 
important feature for managing conflicts; if an observer 
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frequently conflicts on a hotspot, it often helps to break 
it into two observers connected by a non-transactional 
notification on the hotspot. 


2.5 Discussion 


One of the inefficiencies of Percolator relative to a 
MapReduce-based system is the number of RPCs sent 
per work-unit. While MapReduce does a single large 
read to GFS and obtains all of the data for 10s or 100s 
of web pages, Percolator performs around 50 individual 
Bigtable operations to process a single document. 


One source of additional RPCs occurs during commit. 
When writing a lock, we must do a read-modify-write 
operation requiring two Bigtable RPCs: one to read for 
conflicting locks or writes and another to write the new 
lock. To reduce this overhead, we modified the Bigtable 
API by adding conditional mutations which implements 
the read-modify-write step in a single RPC. Many con- 
ditional mutations destined for the same tablet server 
can also be batched together into a single RPC to fur- 
ther reduce the total number of RPCs we send. We create 
batches by delaying lock operations for several seconds 
to collect them into batches. Because locks are acquired 
in parallel, this adds only a few seconds to the latency 
of each transaction; we compensate for the additional la- 
tency with greater parallelism. Batching also increases 
the time window in which conflicts may occur, but in our 
low-contention environment this has not proved to be a 
problem. 

We also perform the same batching when reading from 
the table: every read operation is delayed to give it a 
chance to form a batch with other reads to the same 
tablet server. This delays each read, potentially greatly 
increasing transaction latency. A final optimization miti- 
gates this effect, however: prefetching. Prefetching takes 
advantage of the fact that reading two or more values 
in the same row is essentially the same cost as reading 
one value. In either case, Bigtable must read the entire 
SSTable block from the file system and decompress it. 
Percolator attempts to predict, each time a column is 
read, what other columns in a row will be read later in 
the transaction. This prediction is made based on past be- 
havior. Prefetching, combined with a cache of items that 
have already been read, reduces the number of Bigtable 
reads the system would otherwise do by a factor of 10. 

Early in the implementation of Percolator, we decided 
to make all API calls blocking and rely on running thou- 
sands of threads per machine to provide enough par- 
allelism to maintain good CPU utilization. We chose 
this thread-per-request model mainly to make application 
code easier to write, compared to the event-driven model. 
Forcing users to bundle up their state each of the (many) 
times they fetched a data item from the table would have 
made application development much more difficult. Our 
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experience with thread-per-request was, on the whole, 
positive: application code is simple, we achieve good uti- 
lization on many-core machines, and crash debugging is 
simplified by meaningful and complete stack traces. We 
encountered fewer race conditions in application code 
than we feared. The biggest drawbacks of the approach 
were scalability issues in the Linux kernel and Google 
infrastructure related to high thread counts. Our in-house 
kernel development team was able to deploy fixes to ad- 
dress the kernel issues. 


3 Evaluation 


Percolator lies somewhere in the performance space 
between MapReduce and DBMSs. For example, because 
Percolator is a distributed system, it uses far more re- 
sources to process a fixed amount of data than a tradi- 
tional DBMS would; this is the cost of its scalability. 
Compared to MapReduce, Percolator can process data 
with far lower latency, but again, at the cost of additional 
resources required to support random lookups. These are 
engineering tradeoffs which are difficult to quantify: how 
much of an efficiency loss is too much to pay for the abil- 
ity to add capacity endlessly simply by purchasing more 
machines? Or: how does one trade off the reduction in 
development time provided by a layered system against 
the corresponding decrease in efficiency? 

In this section we attempt to answer some of these 
questions by first comparing Percolator to batch pro- 
cessing systems via our experiences with converting 
a MapReduce-based indexing pipeline to use Percola- 
tor. We'll also evaluate Percolator with microbench- 
marks and a synthetic workload based on the well-known 
TPC-E benchmark [1]; this test will give us a chance to 
evaluate the scalability and efficiency of Percolator rela- 
tive to Bigtable and DBMSs. 

All of the experiments in this section are run on a sub- 
set of the servers in a Google data center. The servers run 
the Linux operating system on x86 processors; each ma- 
chine is connected to several commodity SATA drives. 


3.1. Converting from MapReduce 


We built Percolator to create Google’s large “base” 
index, a task previously performed by MapReduce. In 
our previous system, each day we crawled several billion 
documents and fed them along with a repository of ex- 
isting documents through a series of 100 MapReduces. 
The result was an index which answered user queries. 
Though not all 100 MapReduces were on the critical path 
for every document, the organization of the system as a 
series of MapReduces meant that each document spent 
2-3 days being indexed before it could be returned as a 
search result. 

The Percolator-based indexing system (known as Caf- 
feine [25]), crawls the same number of documents, 
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but we feed each document through Percolator as it 
is crawled. The immediate advantage, and main design 
goal, of Caffeine is a reduction in latency: the median 
document moves through Caffeine over 100x faster than 
the previous system. This latency improvement grows as 
the system becomes more complex: adding a new clus- 
tering phase to the Percolator-based system requires an 
extra lookup for each document rather an extra scan over 
the repository. Additional clustering phases can also be 
implemented in the same transaction rather than in an- 
other MapReduce; this simplification is one reason the 
number of observers in Caffeine (10) is far smaller than 
the number of MapReduces in the previous system (100). 
This organization also allows for the possibility of per- 
forming additional processing on only a subset of the 
repository without rescanning the entire repository. 
Adding additional clustering phases isn’t free in an in- 
cremental system: more resources are required to make 
sure the system keeps up with the input, but this is still 
an improvement over batch processing systems where no 
amount of resources can overcome delays introduced by 
stragglers in an additional pass over the repository. Caf- 
feine is essentially immune to stragglers that were a seri- 
ous problem in our batch-based indexing system because 
the bulk of the processing does not get held up by a few 
very slow operations. The radically-lower latency of the 
new system also enables us to remove the rigid distinc- 
tions between large, slow-to-update indexes and smaller, 
more rapidly updated indexes. Because Percolator frees 
us from needing to process the repository each time we 
index documents, we can also make it larger: Caffeine’s 
document collection is currently 3x larger than the previ- 
ous system’s and is limited only by available disk space. 
Compared to the system it replaced, Caffeine uses 
roughly twice as many resources to process the same 
crawl rate. However, Caffeine makes good use of the ex- 
tra resources. If we were to run the old indexing system 
with twice as many resources, we could either increase 
the index size or reduce latency by at most a factor of two 
(but not do both). On the other hand, if Caffeine were run 
with half the resources, it would not be able to process as 
many documents per day as the old system (but the doc- 
uments it did produce would have much lower latency). 
The new system is also easier to operate. Caffeine has 
far fewer moving parts: we run tablet servers, Percola- 
tor workers, and chunkservers. In the old system, each of 
a hundred different MapReduces needed to be individ- 
ually configured and could independently fail. Also, the 
“peaky” nature of the MapReduce workload made it hard 
to fully utilize the resources of a datacenter compared to 
Percolator’s much smoother resource usage. 
The simplicity of writing straight-line code and the 
ability to do random lookups into the repository makes 
developing new features for Percolator easy. Under 
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Figure 7: Median document clustering delay for Percolator 
(dashed line) and MapReduce (solid line). For MapReduce, all 
documents finish processing at the same time and error bars 
represent the min, median, and max of three runs of the clus- 
tering MapReduce. For Percolator, we are able to measure the 
delay of individual documents, so the error bars represent the 
5th- and 95th-percentile delay on a per-document level. 


MapReduce, random lookups are awkward and costly. 
On the other hand, Caffeine developers need to reason 
about concurrency where it did not exist in the MapRe- 
duce paradigm. Transactions help deal with this concur- 
rency, but can’t fully eliminate the added complexity. 

To quantify the benefits of moving from MapRe- 
duce to Percolator, we created a synthetic benchmark 
that clusters newly crawled documents against a billion- 
document repository to remove duplicates in much the 
same way Google’s indexing pipeline operates. Docu- 
ments are clustered by three clustering keys. In a real sys- 
tem, the clustering keys would be properties of the docu- 
ment like redirect target or content hash, but in this exper- 
iment we selected them uniformly at random from a col- 
lection of 750M possible keys. The average cluster in our 
synthetic repository contains 3.3 documents, and 93% of 
the documents are in a non-singleton cluster. This dis- 
tribution of keys exercises the clustering logic, but does 
not expose it to the few extremely large clusters we have 
seen in practice. These clusters only affect the latency 
tail and not the results we present here. In the Percola- 
tor clustering implementation, each crawled document is 
immediately written to the repository to be clustered by 
an observer. The observer maintains an index table for 
each clustering key and compares the document against 
each index to determine if it is a duplicate (an elabora- 
tion of Figure 2). MapReduce implements clustering of 
continually arriving documents by repeatedly running a 
sequence of three clustering MapReduces (one for each 
clustering key). The sequence of three MapReduces pro- 
cesses the entire repository and any crawled documents 
that accumulated while the previous three were running. 

This experiment simulates clustering documents 
crawled at a uniform rate. Whether MapReduce or Perco- 
lator performs better under this metric is a function of the 
how frequently documents are crawled (the crawl rate) 
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and the repository size. We explore this space by fixing 
the size of the repository and varying the rate at which 
new documents arrive, expressed as a percentage of the 
repository crawled per hour. In a practical system, a very 
small percentage of the repository would be crawled per 
hour: there are over | trillion web pages on the web (and 
ideally in an indexing system’s repository), far too many 
to crawl a reasonable fraction of in a single day. When 
the new input is a small fraction of the repository (low 
crawl rate), we expect Percolator to outperform MapRe- 
duce since MapReduce must map over the (large) repos- 
itory to cluster the (small) batch of new documents while 
Percolator does work proportional only to the small batch 
of newly arrived documents (a lookup in up to three in- 
dex tables per document). At very large crawl rates where 
the number of newly crawled documents approaches the 
size of the repository, MapReduce will perform better 
than Percolator. This cross-over occurs because stream- 
ing data from disk is much cheaper, per byte, than per- 
forming random lookups. At the cross-over the total cost 
of the lookups required to cluster the new documents un- 
der Percolator equals the cost to stream the documents 
and the repository through MapReduce. At crawl rates 
higher than that, one is better off using MapReduce. 


We ran this benchmark on 240 machines and measured 
the median delay between when a document is crawled 
and when it is clustered. Figure 7 plots the median la- 
tency of document processing for both implementations 
as a function of crawl rate. When the crawl rate is low, 
Percolator clusters documents faster than MapReduce as 
expected; this scenario is illustrated by the leftmost pair 
of points which correspond to crawling | percent of doc- 
uments per hour. MapReduce requires approximately 20 
minutes to cluster the documents because it takes 20 
minutes just to process the repository through the three 
MapReduces (the effect of the few newly crawled doc- 
uments on the runtime is negligible). This results in an 
average delay between crawling a document and cluster- 
ing of around 30 minutes: a random document waits 10 
minutes after being crawled for the previous sequence of 
MapReduces to finish and then spends 20 minutes be- 
ing processed by the three MapReduces. Percolator, on 
the other hand, finds a newly loaded document and pro- 
cesses it in two seconds on average, or about 1000x faster 
than MapReduce. The two seconds includes the time to 
find the dirty notification and run the transaction that per- 
forms the clustering. Note that this 1000x latency im- 
provement could be made arbitrarily large by increasing 
the size of the repository. 


As the crawl rate increases, MapReduce’s processing 
time grows correspondingly. Ideally, it would be propor- 
tional to the combined size of the repository and the input 
which grows with the crawl rate. In practice, the running 
time of a small MapReduce like this is limited by strag- 
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Bigtable Percolator | Relative 
Read/s 15513 14590 0.94 
Write/s 31003 7232 0.23 
Figure 8: The overhead of Percolator operations relative to 
Bigtable. Write overhead is due to additional operations Percola- 
tor needs to check for conflicts. 





glers, so the growth in processing time (and thus cluster- 
ing latency) is only weakly correlated to crawl rate at low 
crawl rates. The 6 percent crawl rate, for example, only 
adds 150GB to a 1TB data set; the extra time to process 
150GB is in the noise. The latency of Percolator is rela- 
tively unchanged as the craw] rate grows until it suddenly 
increases to effectively infinity at a crawl rate of 40% 
per hour. At this point, Percolator saturates the resources 
of the test cluster, is no longer able to keep up with the 
crawl rate, and begins building an unbounded queue of 
unprocessed documents. The dotted asymptote at 40% 
is an extrapolation of Percolator’s performance beyond 
this breaking point. MapReduce is subject to the same 
effect: eventually crawled documents accumulate faster 
than MapReduce is able to cluster them, and the batch 
size will grow without bound in subsequent runs. In this 
particular configuration, however, MapReduce can sus- 
tain craw] rates in excess of 100% (the dotted line, again, 
extrapolates performance). 

These results show that Percolator can process docu- 
ments at orders of magnitude better latency than MapRe- 
duce in the regime where we expect real systems to op- 
erate (single-digit craw] rates). 


3.2 Microbenchmarks 


In this section, we determine the cost of the trans- 
actional semantics provided by Percolator. In these ex- 
periments, we compare Percolator to a “raw” Bigtable. 
We are only interested in the relative performance 
of Bigtable and Percolator since any improvement in 
Bigtable performance will translate directly into an im- 
provement in Percolator performance. Figure 8 shows 
the performance of Percolator and raw Bigtable running 
against a single tablet server. All data was in the tablet 
server’s cache during the experiments and Percolator’s 
batching optimizations were disabled. 

As expected, Percolator introduces overhead relative 
to Bigtable. We first measure the number of random 
writes that the two systems can perform. In the case of 
Percolator, we execute transactions that write a single 
cell and then commit; this represents the worst case for 
Percolator overhead. When doing a write, Percolator in- 
curs roughly a factor of four overhead on this benchmark. 
This is the result of the extra operations Percolator re- 
quires for commit beyond the single write that Bigtable 
issues: a read to check for locks, a write to add the lock, 
and a second write to remove the lock record. The read, 
in particular, is more expensive than a write and accounts 
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for most of the overhead. In this test, the limiting fac- 
tor was the performance of the tablet server, so the addi- 
tional overhead of fetching timestamps is not measured. 
We also tested random reads: Percolator performs a sin- 
gle Bigtable operation per read, but that read operation 
is somewhat more complex than the raw Bigtable oper- 
ation (the Percolator read looks at metadata columns in 
addition to data columns). 


3.3. Synthetic Workload 


To evaluate Percolator on a more realistic work- 
load, we implemented a synthetic benchmark based on 
TPC-E [1]. This isn’t the ideal benchmark for Percola- 
tor since TPC-E is designed for OLTP systems, and a 
number of Percolator’s tradeoffs impact desirable prop- 
erties of OLTP systems (the latency of conflicting trans- 
actions, for example). TPC-E is a widely recognized and 
understood benchmark, however, and it allows us to un- 
derstand the cost of our system against more traditional 
databases. 

TPC-E simulates a brokerage firm with customers who 
perform trades, market search, and account inquiries. 
The brokerage submits trade orders to a market ex- 
change, which executes the trade and updates broker and 
customer state. The benchmark measures the number of 
trades executed. On average, each customer performs a 
trade once every 500 seconds, so the benchmark scales 
by adding customers and associated data. 

TPC-E traditionally has three components — a cus- 
tomer emulator, a market emulator, and a DBMS run- 
ning stored SQL procedures. Since Percolator is a client 
library running against Bigtable, our implementation is 
a combined customer/market emulator that calls into 
the Percolator library to perform operations against 
Bigtable. Percolator provides a low-level Get/Set/iterator 
API rather than a high-level SQL interface, so we created 
indexes and did all the ‘query planning’ by hand. 

Since Percolator is an incremental processing system 
rather than an OLTP system, we don’t attempt to meet the 
TPC-E latency targets. Our average transaction latency is 
2 to 5 seconds, but outliers can take several minutes. Out- 
liers are caused by, for example, exponential backoff on 
conflicts and Bigtable tablet unavailability. Finally, we 
made a small modification to the TPC-E transactions. In 
TPC-E, each trade result increases the broker’s commis- 
sion and increments his trade count. Each broker services 
a hundred customers, so the average broker must be up- 
dated once every 5 seconds, which causes repeated write 
conflicts in Percolator. In Percolator, we would imple- 
ment this feature by writing the increment to a side table 
and periodically aggregating each broker’s increments; 
for the benchmark, we choose to simply omit this write. 

Figure 9 shows how the resource usage of Percolator 
scales as demand increases. We will measure resource 
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Figure 9: Transaction rate on a TPC-E-like benchmark as a func- 
tion of cores used. The dotted line shows linear scaling. 


usage in CPU cores since that is the limiting resource 
in our experimental environment. We were able to pro- 
cure a small number of machines for testing, but our 
test Bigtable cell shares the disk resources of a much 
larger production cluster. As a result, disk bandwidth 
is not a factor in the system’s performance. In this ex- 
periment, we configured the benchmark with increasing 
numbers of customers and measured both the achieved 
performance and the number of cores used by all parts 
of the system including cores used for background main- 
tenance such as Bigtable compactions. The relationship 
between performance and resource usage is essentially 
linear across several orders of magnitude, from 11 cores 
to 15,000 cores. 

This experiment also provides an opportunity to mea- 
sure the overheads in Percolator relative to a DBMS. 
The fastest commercial TPC-E system today performs 
3,183 tpsE using a single large shared-memory machine 
with 64 Intel Nehalem cores with 2 hyperthreads per 
core [33]. Our synthetic benchmark based on TPC-E per- 
forms 11,200 tps using 15,000 cores. This comparison 
is very rough: the Nehalem cores in the comparison ma- 
chine are significantly faster than the cores in our test cell 
(small-scale testing on Nehalem processors shows that 
they are 20-30% faster per-thread compared to the cores 
in the test cluster). However, we estimate that Percola- 
tor uses roughly 30 times more CPU per transaction than 
the benchmark system. On a cost-per-transaction basis, 
the gap is likely much less than 30 since our test clus- 
ter uses cheaper, commodity hardware compared to the 
enterprise-class hardware in the reference machine. 

The conventional wisdom on implementing databases 
is to “get close to the iron” and use hardware as directly 
as possible since even operating system structures like 
disk caches and schedulers make it hard to implement 
an efficient database [32]. In Percolator we not only in- 
terposed an operating system between our database and 
the hardware, but also several layers of software and net- 
work links. The conventional wisdom is correct: this ar- 
rangement has a cost. There are substantial overheads in 
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Figure 10: Recovery of tps after 33% tablet server mortality 


preparing requests to go on the wire, sending them, and 
processing them on a remote machine. To illustrate these 
overheads in Percolator, consider the act of mutating the 
database. In a DBMS, this incurs a function call to store 
the data in memory and a system call to force the log to 
hardware controlled RAID array. In Percolator, a client 
performing a transaction commit sends multiple RPCs 
to Bigtable, which commits the mutation by logging it 
to 3 chunkservers, which make system calls to actually 
flush the data to disk. Later, that same data will be com- 
pacted into minor and major sstables, each of which will 
be again replicated to multiple chunkservers. 

The CPU inflation factor is the cost of our layering. 
In exchange, we get scalability (our fastest result, though 
not directly comparable to TPC-E, is more than 3x the 
current official record [33]), and we inherit the useful 
features of the systems we build upon, like resilience to 
failures. To demonstrate the latter, we ran the benchmark 
with 15 tablet servers and allowed the performance to 
stabilize. Figure 10 shows the performance of the system 
over time. The dip in performance at 17:09 corresponds 
to a failure event: we killed a third of the tablet servers. 
Performance drops immediately after the failure event 
but recovers as the tablets are reloaded by other tablet 
servers. We allowed the killed tablet servers to restart so 
performance eventually returns to the original level. 


4 Related Work 


Batch processing systems like MapReduce [13, 22, 
24] are well suited for efficiently transforming or analyz- 
ing an entire corpus: these systems can simultaneously 
use a large number of machines to process huge amounts 
of data quickly. Despite this scalability, re-running a 
MapReduce pipeline on each small batch of updates re- 
sults in unacceptable latency and wasted work. Over- 
lapping or pipelining the adjacent stages can reduce la- 
tency [10], but straggler shards still set the minimum 
time to complete the pipeline. Percolator avoids the ex- 
pense of repeated scans by, essentially, creating indexes 
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on the keys used to cluster documents; one of criticisms 
leveled by Stonebraker and DeWitt in their initial critique 
of MapReduce [16] was that MapReduce did not support 
such indexes. 

Several proposed modifications to MapReduce [18, 
26, 35] reduce the cost of processing changes to a reposi- 
tory by allowing workers to randomly read a base repos- 
itory while mapping over only newly arrived work. To 
implement clustering in these systems, we would likely 
maintain a repository per clustering phase. Avoiding the 
need to re-map the entire repository would allow us to 
make batches smaller, reducing latency. DryadInc [31] 
attacks the same problem by reusing identical portions 
of the computation from previous runs and allowing the 
user to specify a merge function that combines new in- 
put with previous iterations’ outputs. These systems rep- 
resent a middle-ground between mapping over the en- 
tire repository using MapReduce and processing a single 
document at a time with Percolator. 

Databases satisfy many of the requirements of an 
incremental system: a RDBMS can make many inde- 
pendent and concurrent changes to a large corpus and 
provides a flexible language for expressing computa- 
tion (SQL). In fact, Percolator presents the user with 
a database-like interface: it supports transactions, itera- 
tors, and secondary indexes. While Percolator provides 
distributed transactions, it is by no means a full-fledged 
DBMS: it lacks a query language, for example, as well 
as full relational operations such as join. Percolator is 
also designed to operate at much larger scales than exist- 
ing parallel databases and to deal better with failed ma- 
chines. Unlike Percolator, database systems tend to em- 
phasize latency over throughput since a human is often 
waiting for the results of a database query. 

The organization of data in Percolator mirrors that 
of shared-nothing parallel databases [7, 15, 4]. Data is 
distributed across a number of commodity machines in 
shared-nothing fashion: the machines communicate only 
via explicit RPCs; no shared memory or shared disks are 
used. Data stored by Percolator is partitioned by Bigtable 
into tablets of contiguous rows which are distributed 
among machines; this mirrors the declustering performed 
by parallel databases. 

The transaction management of Percolator builds on a 
long line of work on distributed transactions for database 
systems. Percolator implements snapshot isolation [5] by 
extending multi-version timestamp ordering [6] across a 
distributed system using two-phase commit. 

An analogy can be drawn between the role of ob- 
servers in Percolator to incrementally move the system 
towards a “clean” state and the incremental maintenance 
of materialized views in traditional databases (see Gupta 
and Mumick [21] for a survey of the field). In practice, 
while some indexing tasks like clustering documents by 
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contents could be expressed in a form appropriate for in- 
cremental view maintenance it would likely be hard to 
express the transformation of a raw document into an in- 
dexed document in such a form. 


The utility of parallel databases and, by extension, 
a system like Percolator, has been questioned several 
times [17] over their history. Hardware trends have, in 
the past, worked against parallel databases. CPUs have 
become so much faster than disks that a few CPUs in 
a shared-memory machine can drive enough disk heads 
to service required loads without the complexity of dis- 
tributed transactions: the top TPC-E benchmark results 
today are achieved on large shared-memory machines 
connected to a SAN. This trend is beginning to reverse 
itself, however, as the enormous datasets like those Per- 
colator is intended to process become far too large for a 
single shared-memory machine to handle. These datasets 
require a distributed solution that can scale to 1000s of 
machines, while existing parallel databases can utilize 
only 100s of machines [30]. Percolator provides a sys- 
tem that is scalable enough for Internet-sized datasets by 
sacrificing some (but not all) of the flexibility and low- 
latency of parallel databases. 

Distributed storage systems like Bigtable have the 
scalability and fault-tolerance properties of MapReduce 
but provide a more natural abstraction for storing a repos- 
itory. Using a distributed storage system allows for low- 
latency updates since the system can change state by mu- 
tating the repository rather than rewriting it. However, 
Percolator is a data transformation system, not only a 
data storage system: it provides a way to structure com- 
putation to transform that data. In contrast, systems like 
Dynamo [14], Bigtable, and PNUTS [11] provide highly 
available data storage without the attendant mechanisms 
of transformation. These systems can also be grouped 
with the NoSQL databases (MongoDB [27], to name one 
of many): both offer higher performance and scale better 
than traditional databases, but provide weaker semantics. 


Percolator extends Bigtable with multi-row, dis- 
tributed transactions, and it provides the observer inter- 
face to allow applications to be structured around notifi- 
cations of changed data. We considered building the new 
indexing system directly on Bigtable, but the complexity 
of reasoning about concurrent state modification without 
the aid of strong consistency was daunting. Percolator 
does not inherit all of Bigtable’s features: it has limited 
support for replication of tables across data centers, for 
example. Since Bigtable’s cross data center replication 
strategy is consistent only on a per-tablet basis, replica- 
tion is likely to break invariants between writes in a dis- 
tributed transaction. Unlike Dynamo and PNUTS which 
serve responses to users, Percolator is willing to accept 
the lower availability of a single data center in return for 
stricter consistency. 


Several research systems have, like Percolator, ex- 
tended distributed storage systems to include strong con- 
sistency. Sinfonia [3] provides a transactional interface 
to a distributed repository. Earlier published versions of 
Sinfonia [2] also offered a notification mechanism simi- 
lar to the Percolator’s observer model. Sinfonia and Per- 
colator differ in their intended use: Sinfonia is designed 
to build distributed infrastructure while Percolator is in- 
tended to be used directly by applications (this probably 
explains why Sinfonia’s authors dropped its notification 
mechanism). Additionally, Sinfonia’s mini-transactions 
have limited semantics compared to the transactions pro- 
vided by RDBMSs or Percolator: the user must specify 
a list of items to compare, read, and write prior to issu- 
ing the transaction. The mini-transactions are sufficient 
to create a wide variety of infrastructure but could be 
limiting for application builders. 

CloudTPS [34], like Percolator, builds an ACID- 
compliant datastore on top of a distributed storage sys- 
tem (HBase [23] or Bigtable). Percolator and CloudTPS 
systems differ in design, however: the transaction man- 
agement layer of CloudTPS is handled by an interme- 
diate layer of servers called local transaction managers 
that cache mutations before they are persisted to the un- 
derlying distributed storage system. By contrast, Perco- 
lator uses clients, directly communicating with Bigtable, 
to coordinate transaction management. The focus of the 
systems is also different: CloudTPS is intended to be a 
backend for a website and, as such, has a stronger focus 
on latency and partition tolerance than Percolator. 

ElasTraS [12], a transactional data store, is architec- 
turally similar to Percolator; the Owning Transaction 
Managers in ElasTraS are essentially tablet servers. Un- 
like Percolator, ElasTraS offers limited transactional se- 
mantics (Sinfonia-like mini-transactions) when dynami- 
cally partitioning the dataset and has no support for struc- 
turing computation. 


5 Conclusion and Future Work 


We have built and deployed Percolator and it has been 
used to produce Google’s websearch index since April, 
2010. The system achieved the goals we set for reducing 
the latency of indexing a single document with an accept- 
able increase in resource usage compared to the previous 
indexing system. 

The TPC-E results suggest a promising direction for 
future investigation. We chose an architecture that scales 
linearly over many orders of magnitude on commodity 
machines, but we’ve seen that this costs a significant 30- 
fold overhead compared to traditional database architec- 
tures. We are very interested in exploring this tradeoff 
and characterizing the nature of this overhead: how much 
is fundamental to distributed storage systems, and how 
much can be optimized away? 
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Abstract- Experience from an operational Map-Reduce 
cluster reveals that outliers significantly prolong job com- 
pletion. The causes for outliers include run-time con- 
tention for processor, memory and other resources, disk 
failures, varying bandwidth and congestion along net- 
work paths and, imbalance in task workload. We present 
Mantri, a system that monitors tasks and culls outliers us- 
ing cause- and resource-aware techniques. Mantri’s strate- 
gies include restarting outliers, network-aware placement 
of tasks and protecting outputs of valuable tasks. Using 
real-time progress reports, Mantri detects and acts on out- 
liers early in their lifetime. Early action frees up resources 
that can be used by subsequent tasks and expedites the job 
overall. Acting based on the causes and the resource and 
opportunity cost of actions lets Mantri improve over prior 
work that only duplicates the laggards. Deployment in 
Bing’s production clusters and trace-driven simulations 
show that Mantri improves job completion times by 32%. 


1 Introduction 


In a very short time, Map-Reduce has become the domi- 
nant paradigm for large data processing on compute clus- 
ters. Software frameworks based on Map-Reduce [1, 11, 
13] have been deployed on tens of thousands of machines 
to implement a variety of applications, such as building 
search indices, optimizing advertisements, and mining 
social networks. 

While highly successful, Map-Reduce clusters come 
with their own set of challenges. One such challenge is 
the often unpredictable performance of the Map-Reduce 
jobs. A job consists of a set of tasks which are organized in 
phases. Tasks in a phase depend on the results computed 
by the tasks in the previous phase and can run in paral- 
lel. When a task takes longer to finish than other similar 
tasks, tasks in the subsequent phase are delayed. At key 
points in the job, a few such outlier tasks can prevent the 
rest of the job from making progress. As the size of the 
cluster and the size of the jobs grow, the impact of outliers 
increases dramatically. Addressing the outlier problem is 
critical to speed up job completion and improve cluster 
efficiency. 

Even a few percent of improvement in the efficiency 
of a cluster consisting of tens of thousands of nodes can 


save millions of dollars a year. In addition, finishing pro- 
duction jobs quickly is a competitive advantage. Doing 
so predictably allows SLAs to be met. In iterative mod- 
ify/ debug/ analyze development cycles, the ability to it- 
erate faster improves programmer productivity. 


In this paper, we characterize the impact and causes 
of outliers by measuring a large Map-Reduce production 
cluster. This cluster is up to two orders of magnitude 
larger than those in previous publications [1, 13, 20] and 
exhibits a high level of concurrency due to many jobs si- 
multaneously running on the cluster and many tasks on 
a machine. We find that variation in completion times 
among functionally similar tasks is large and that outliers 
inflate the completion time of jobs by 34% at median. 

We identify three categories of root causes for outliers 
that are induced by the interplay between storage, net- 
work and structure of Map-Reduce jobs. First, machine 
characteristics play a key role in the performance of tasks. 
These include static aspects such as hardware reliabil- 
ity (e.g., disk failures) and dynamic aspects such as con- 
tention for processor, memory and other resources. Sec- 
ond, network characteristics impact the data transfer rates 
of tasks. Datacenter networks are over-subscribed leading 
to variance in congestion among different paths. Finally, 
the specifics of Map-Reduce leads to imbalance in work - 
partitioning data over a low entropy key space often leads 
to a skew in the input sizes of tasks. 


We present Mantri’, a system that monitors tasks and 
culls outliers based on their causes. It uses the follow- 
ing techniques: (i) Restarting outlier tasks cognizant of 
resource constraints and work imbalances, (ii) Network- 
aware placement of tasks, and (iii) Protecting output of 
tasks based on a cost-benefit analysis. 

The detailed analysis and decision process employed by 
Mantri is a key departure from the state-of-the-art for out- 
lier mitigation in Map-Reduce implementations [11, 13, 
20]; these focus only on duplicating tasks. To our knowl- 
edge, none of them protect against data loss induced re- 
computations or network congestion induced outliers. 
Mantri places tasks based on the locations of their data 
sources as well as the current utilization of network links. 
On a task’s completion, Mantri replicates its output if the 


‘From Sanskrit, a minister who keeps the king’s court in order 
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benefit of not having to recompute outweighs the cost of 
replication. 

Further, Mantri performs intelligent restarting of out- 
liers. A task that runs for long because it has more work 
to do will not be restarted; if it lags due to reading data 
over a low-bandwidth path, it will be restarted only if a 
more advantageous network location becomes available. 
Unlike current approaches that duplicate tasks only at the 
end of a phase, Mantri uses real-time progress reports to 
act early. While early action on outliers frees up resources 
that could be used for pending tasks, doing so is nontriv- 
ial. A duplicate may finish faster than the original task 
but has the opportunity cost of consuming resources that 
other pending work could have used. 


In summary we make the following contributions. 
First, we provide an analysis of the causes of outliers in 
a large production Map-Reduce cluster. Second, we de- 
velop Mantri, that takes early actions based on under- 
standing the causes and the opportunity cost of actions. 
Finally, we perform an extensive evaluation of Mantri and 
compare it to existing solutions. 

Mantri runs live in all of Bing’s production clusters since 
May 2010. Results from a deployment of Mantri on a pro- 
duction cluster of thousands of servers and from replay- 
ing several thousand jobs collected on this cluster in a 
simulator show that: 
¢ Mantri reduces the completion time of jobs by 32% on 

average on the production clusters. Extensive simula- 

tions show that job phases are quicker by 21% and 42% 

at the 5oth and 75th percentiles. Mantri’s median re- 

duction in completion time improves on the next best 
scheme by 3.1x while using fewer resources. 

e By placing reduce tasks to avoid network hotspots, 
Mantri improves the completion times of the reduce 
phases by 60%. 

« By preferentially replicating the output of tasks that are 
more likely to be lost or expensive to recompute, Mantri 
speeds up half of the jobs by at least 20% each while 
only increasing the network traffic by 1%. 


2 Background 


We monitored the cluster and software systems that sup- 
port the Bing search engine for over twelve months. This 
is a cluster of tens of thousands of commodity servers 
managed by Cosmos [8], a proprietary upgraded form of 
Dryad [13]. Despite a few differences, implementations 
of Map-Reduce [1, 8, 11, 13] are broadly similar. 

Most of the jobs in the examined cluster are written 
in Scope [8], a mash-up language that mixes SQL-like 
declarative statements with user code. The Scope com- 
piler transforms a job into a workflow- a directed acyclic 
graph where each node is a phase and each edge joins a 
phase that produces data to another that uses it. A phase 
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is a set of one or more tasks that run in parallel and per- 
form the same computation on different parts of the in- 
put stream. Typical phases are map, reduce and join. The 
number of tasks in a phase is chosen at compile time. A 
task will read its input over the network if it is not avail- 
able on the local disk but outputs are written to the local 
disk. The eventual outputs of a job (as well as raw data) are 
stored in a reliable block storage system implemented on 
the same servers that do computation. Blocks are repli- 
cated n-ways for reliability. A run-time scheduler assigns 
tasks to machines, based on data locations, dependence 
patterns and cluster-wide resource availability. The net- 
work layout provides more bandwidth within a rack than 
across racks. 

We obtain detailed logs from the Scope compiler and 
the Cosmos scheduler. At each of the job, phase and task 
levels, we record the execution behavior as represented 
by begin and end times, the machines(s) involved, the 
sizes of input and output data, the fraction of data that 
was read across racks and a code denoting the success or 
type of failure. We also record the workflow of jobs. Ta- 
ble 1 depicts the random subset of logs that we analyze 
here. Spanning eighteen days, this dataset is at least one 
order of magnitude larger than prior published data along 
many dimensions, e.g., number of jobs, cluster size. 


3 The Outlier Problem 


We begin with a first principles approach to the outlier 
problem, then analyze data from the production cluster 
to quantify the problem and obtain a breakdown of the 
causes of outliers ($4). Beginning at the first principles 
motivates a distinct approach (§5) which as we show in $6 
significantly improves on prior art. 


3.1 Outliers in a Phase 


Assume a phase consists of n tasks and has s slots. Slot is a 
virtual token, akin to a quota, for sharing cluster resources 
among multiple jobs. One task can run per slot at a time. 
On our cluster, the median ratio of 4 is 2.11 with a stdev 
of 12.37. The goal is to minimize the phase completion 
time, i.e., the time when the last task finishes. 

Based on data from the production cluster, we model 
t;, the completion time of task 7, as a function of the size 
of the data it processes, the code it runs, the resources 
available on the machine it executes and the bandwidth 
available on the network paths involved: 

t; = f (datasize, code, machine, network). — (1) 


Large variation exists along each of the four variables 
leading to considerable difference in task completion 
times. The amount of data processed by tasks in the same 
phase varies, sometimes widely, due to limitations in di- 
viding work evenly. The code is the same for tasks in a 
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May 25,26 19.0 938 49.1 12.6 -66 
Jun 16,17 16.5 991 88.0 22.7 132 
Jul 20,21 22.0 1183 51.6 14.3 .67 
Aug 20,21 29.2 1873 60.6 18.7 -76 
Sep 15,16 27.4 1653 73.0 22.8 73 
Oct 15,16 20.4 1362 84.1 25.3 .86 
Nov 16,17 37.8 1834 88.4 25.0 -68 
Dec 10,11 18.7 1777 96.2 18.6 57.2 
Jan 11,12 24.4 1842 79.5 21.5 1.99 


Table 1: Details of the logs from a production cluster consisting 
of thousands of servers. 


phase, but differs significantly across phases (e.g., map 
and reduce). Placing a task on a machine that has other 
resource hungry tasks inflates completion time, as does 
reading data across congested links. 

In the ideal scenario, where every task takes the same 
amount of time, say 7’, scheduling is simple. Any 
work-conserving schedule would complete the phase in 
( [2] x T) . When the task completion time varies, how- 


ever, a naive work-conserving scheduler can take up to 


— + max ti). A large variation in t; increases the 


term max t; and manifests as outliers. 

The goal of a scheduler is to minimize the phase com- 
pletion time and make it closer to 2m s Sometimes, it 
can do even better. By placing tasks at less congested ma- 
chines or network locations, the ¢;’s themselves can be 
lowered. The challenge lies in recognizing the aspects that 
can be changed and scheduling accordingly. 





3.2 Extending from a phase to a job 


The phase structure of Map-Reduce jobs adds to the vari- 
ability. An outlier in an early phase, by delaying when 
tasks that use its output may start, has cumulative effects 
on the job. At barriers in the workflow, where none of 
the tasks in successive phase(s) can begin until all of the 
tasks in the preceding phase(s) finish, even one outlier 
can bring the job to a standstill’. Barriers occur primar- 
ily due to reduce operations that are neither commuta- 
tive nor associative [28], for instance, a reduce that com- 
putes the median of records that have the same key. In 
our cluster, the median job workflow has eight phases and 
eleven edges, 47% are barriers (number of edges exceeds 
the number of phases due to table joins). 

Dependency across phases also leads to outliers when 
task output is lost and needs to be recomputed. Data loss 
happens due to a combination of disk errors, software er- 


*There is a variant in implementation where a slot is reserved for 
a task before all its inputs are ready. This is either to amortize the 
latency of network transfer by moving data over the network as soon 
as it is generated [1, 11], or compute partial results and present answers 
online even before the job is complete [9]. Regardless, pre-allocation 
of slots hogs resources for longer periods if the input task(s) straggle. 
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Figure 1: An example job from the production cluster 


rors (e.g., bugs in garbage collectors) and timeouts due to 
machines going unresponsive at times of high load. In 
fact, recomputes cause some of the longest waiting times 
observed on the production cluster. A recompute can cas- 
cade into earlier phases if the inputs for the recomputed 
task are no longer available and need to be regenerated. 


3.3 Illustration of Outliers 


Figure 1(a) shows the workflow for a job whose structure 
is typical of those in the cluster. The job reads a dataset 
of search usage and derives an index. It consists of two 
Map-Reduce operations and a join, but for clarity we only 
show the first Map-Reduce here. Phase names follow the 
Dryad [13] convention- extract reads raw blocks, parti- 
tion divides data on the key and aggregate reduces items 
that share a key. 

Figure 1(b) depicts a timeline of an execution of this 
workflow. It plots the number of tasks of each phase that 
are active, normalized by the maximum tasks active at any 
time in that phase, over the lifetime of the job. Tasks in 
the first two phases start in quick succession to each other 
at ~.05, whereas the third starts after a barrier. 

Some of the outliers are evident in the long lulls before 
a phase ends when only a few of its tasks are active. In par- 
ticular, note the regions before x~.1 and x~.5. The spike 
in phase #2 here is due to the outliers in phase #1 holding 
on to the job's slots. At the barrier, x~.1, just a few outliers 
hold back the job from making forward progress. Though 
most aggregate tasks finish at x~.3, the phase persists for 
another 20%. 

The worst cases of waiting immediately follow recom- 
putations of lost intermediate data marked by R. Recom- 
putations manifest as tiny blips near the x axes for phases 
that had finished earlier, e.g., phase #2 sees recomputes at 
x~.2 though it finished at x~.1. At x~.2, note that aggre- 
gate almost stops due to a few recomputations. 
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We now quantify the magnitude of the outlier problem, 
before presenting our solution in detail. 


4 Quantifying the Outlier Problem 


We characterize the prevalence and causes of outliers and 
their impact on job completion times and cluster resource 
usage. We will argue that three factors - dynamics, con- 
currency and scale, that are somewhat unique to large 
Map-Reduce clusters for efficient and economic opera- 
tion, lie at the core of the outlier problem. To our knowl- 
edge, we are the first to report detailed experiences from 
a large production Map-Reduce cluster. 


4.1 Prevalence of Outliers 


Figure 2(a) plots the fraction of high runtime outliers and 
recomputes in a phase. For exposition, we arbitrarily say 
that a task has high runtime if its time to finish is longer 
than 1.5x the median task duration in its phase. By re- 
computes, we mean instances where a task output is lost 
and dependent tasks wait until the output is regenerated. 

We see in Figure 2(a) that 25% of phases have more 
than 15% of their tasks as outliers. The figure also shows 
that 99% of the phases see no recomputes. Though rare, 
recomputes have a widespread impact ($4.3). Two out of 
a thousand phases have over 50% of their tasks waiting for 
data to be recomputed. 

How much longer do outliers run for? Figure 2(b) 
shows that 80% of the runtime outliers last less than 2.5 
times the phase’s median task duration, with a uniform 
probability of being delayed by between 1.5x to 2.5x. The 
tail is heavy and long- 10% take more than 10x the me- 
dian duration. Ignoring these if they happen early in a 
phase, as current approaches do, appears wasteful. 

Figure 2(b) shows that most recomputations behave 
normally, 90% of them are clustered about the median 
task, but 3% take over 10x longer. 


4.2 Causes of Outliers 


To tease apart the contributions of each cause, we first de- 
termine whether a task’s runtime can be explained by the 
amount of data it processes or reads across the network’. 
If yes, then the outlier is likely due to workload imbalance 
or poor placement. Otherwise, the outlier is likely due to 
resource contention or problematic machines. 

Figure 3(a) shows that in 40% of the phases (top right), 
all the tasks with high runtimes (i-e., over 1.5x the me- 


*For each phase, we fit a linear regression model for task lifetime 
given the size of input and the volume of traffic moved across low 
bandwidth links. When the residual error for a task is less than 20%, 
ie., its run time is within [.8, 1.2]x of the time predicted by this model, 
we call it explainable. 
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Figure 2: Prevalence of Outliers. 


dian task) are well explained by the amount of data they 
process or move on the network. Duplicating these tasks 
would not make them run faster and will waste resources. 
At the other extreme, in 18% of the phases (bottom left), 
none of the high runtime tasks are explained by the data 
they process. Figure 3(b) shows tasks that take longer 
than they should, as predicted by the model, but do not 
take over 1.5x the median task in their phase. Such tasks 
present an opportunity for improvement. They may fin- 
ish faster ifrun elsewhere, yet current schemes do nothing 
for them. 20% of the phases (on the top right) have over 
55% of such improvable tasks. 


Data Skew: It is natural to ask why data size varies across 
tasks in a phase. Across phases, the coefficient of vari- 
ation (sider) in data size is .34 and 3.1 at the 50°” and 
go'” percentiles. From experience, dividing work evenly 
is non-trivial for a few reasons. First, scheduling each ad- 
ditional task has overhead at the job manager. Network 
bandwidth is another reason. There might be too much 
data on a machine for a task to process, but it may be 
worse to split the work into multiple tasks and move data 
over the network. A third reason is poor coding practice. 
If the data is partitioned on a key space that has too little 
entropy, i.e., a few keys correspond to a lot of data, then 
the partitions will differ in size. Some reduce tasks are not 
amenable to splitting (neither commutative nor associa- 
tive [27]), and hence each partition has to be processed by 
one task. Some joins and sorts are similarly constrained. 
Duplicating tasks that run for long because they have a lot 
of work to do is counter-productive. 


Crossrack Traffic: Reduce phases contribute over 70% 
of the cross rack traffic in the cluster, while most of the 
rest is due to joins. We focus on cross rack traffic because 
the links upstream of the racks have less bandwidth than 
the cumulative capacity of servers in the rack. 

We find that crossrack traffic leads to outliers in two 
ways. First, in phases where moving data across racks is 
avoidable (through locality constraints), a task that ends 
up in a disadvantageous network location runs slower 
than others. Second, in phases where moving data across 
racks is unavoidable, not accounting for the competition 
among tasks within the phase (self-interference) leads to 
outliers. In a reduce phase, for example, each task reads 
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Figure 4: For reduce phases, the reduction in comple- 
tion time over the current placement by placing tasks in a 
network-aware fashion. 
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Figure 5: The ratio of processor and memory usage when 
recomputations happen to the average at that machine (y1). 
Also, the cumulative percentage of recomputations across ma- 
chines (y2). 


from every map task. Since the maps are spread across the 
cluster, regardless of where a reduce task is placed, it will 
read a lot of data from other racks. Current implementa- 
tions place reduce tasks on any machine with spare slots. 
A rack that has too many reduce tasks will be congested 
on its downlink leading to outliers. 

Figure 4 compares the current placement with an ideal 
one that minimizes the impact of network transfer. When 
possible it avoids reading data across racks and if not, 
places tasks such that their competition for bandwidth 
does not result in hotspots. In over 50% of the jobs, reduce 
phases account for 17% of the job’s lifetime. For the re- 
duce phases, the figure shows that the median phase takes 
62% longer under the current placement. 


Bad and Busy Machines: We rarely find machines that 
persistently inflate runtimes. Recomputations, however, 
are more localized. Half of them happen on 5% of the ma- 
chines in the cluster. Figure 5 plots the cumulative share 
of recomputes across machines on the axes on the right. 
The figure also plots the ratio of processor and memory 
utilization during recomputes to the overall average on 
that machine. The occurrence of recomputes is correlated 
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Figure 6: Clustering recomputations and outliers. 


with increased use of resources by at least 20%. The sub- 
set of machines that triggers most of the recomputes is 
steady over days but varies over weeks, likely indicative 
of changing hotspots in data popularity or corruption in 
disks [7]. 

Figure 6 investigates the occurrence of “spikes” in out- 
liers. For legibility, we only plot a subset of the machines. 
We find that runtime outliers (shown as stars) cluster by 
time. If outliers were happening at random, there should 
not be any horizontal bands. Rather it appears that jobs 
contend for resources at some times. Even at these busy 
times, other lightly loaded machines exist. Recomputa- 
tions (shown as circles) cluster by machine. When a ma- 
chine loses the output of a task, it has a higher chance of 
losing the output of other tasks. 

Rarely does an entire rack of servers experience the 
same anomaly. When an anomaly happens, the frac- 
tion of other machines within the rack that see the same 
anomaly is less than ~ for recomputes, and _ for run- 
time with high probability. So, it is possible to restart a 
task, or replicate output to protect against loss on another 
machine within the same rack as the original machine. 


4.3 Impact of Outliers 


We now examine the impact of outliers on job comple- 
tion times and cluster usage. Figure 7 plots the CDF for 
the ratio of job completion times, with different types of 
outliers included, to an ideal execution that neither has 
skewed run times nor loses intermediate data. The y-axes 
weighs each job by the total cluster time its tasks take to 
run. The hypothetical scenarios, with some combination 
of outliers present but not the others, do not exist in prac- 
tice. So we replayed the logs in a trace driven simulator 
that retains the structure of the job, the observed task du- 
rations and the probabilities of the various anomalies (de- 
tails in §6). The figure shows that at median, the job com- 
pletion time would be lower by 15% if runtime outliers 
did not happen, and by more than 34% when none of the 
outliers happen. Recomputations impact fewer jobs than 
runtime outliers, but when they do, they delay comple- 
tion time by a larger amount. 
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Figure 7: Percentage speed-up of job completion time in the 
ideal case when (some combination of) outliers do not occur. 
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Figure 8: The Outlier Problem: Causes and Solutions 


By inducing high variability in repeat runs of the same 
job, outliers make it hard to meet SLAs. At median, the 
ratio of Sa in job completion time is 0.8, i-e., jobs have a 
non-trivial probability of taking twice as long or finishing 
half as quickly. 

To summarize, we take the following lessons from our 
experience. 
¢ High running times of tasks do not necessarily indicate 

slow execution - there are multiple reasons for legiti- 

mate variation in durations of tasks. 

e Every job is guaranteed some slots, as determined by 
cluster policy, but can use idle slots of other jobs. 
Hence, judicious usage of resources while mitigating 
outliers has collateral benefit. 

¢ Recomputations affect jobs disproportionately. They 
manifest in select faulty machines and during times of 
heavy resource usage. Nonetheless, there are no indi- 
cations of faulty racks. 


5 Mantri Design 


Mantri identifies points at which tasks are unable to make 
progress at the normal rate and implements targeted solu- 
tions. The guiding principles that distinguish Mantri from 
prior outlier mitigation schemes are cause awareness and 
resource cognizance. 

Distinct actions are required for different causes. Fig- 
ure 8 specifies the actions Mantri takes for each cause. Ifa 
task straggles due to contention for resources on the ma- 
chine, restarting or duplicating it elsewhere can speed it 
up ($5.1). However, not moving data over the low band- 
width cross rack links, and if unavoidable, doing so while 
avoiding hotspots requires systematic placement (§5.2). 
To speed up tasks that wait for lost input to be recom- 
puted, we find ways to protect task output (§5.3). Finally, 
for tasks with a work imbalance, we schedule the large 
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Figure 9: A stylized example to illustrate our main ideas. Tasks 
that are eventually killed are filled with stripes, repeat instances 
of a task are filled with a lighter mesh. 


tasks before the others to avoid being stuck with the large 
ones near completion (§5.4). 

There is a subtle point with outlier mitigation: reduc- 
ing the completion time of a task may in fact increase the 
job completion time. For example, replicating the output 
of every task will drastically reduce recomputations—both 
copies are unlikely to be lost at the same time, but can 
slow down the job because more time and bandwidth are 
used up for this task denying resources to other tasks that 
are waiting to run. Similarly, addressing outliers early in 
a phase vacates slots for outstanding tasks and can speed 
up completion. But, potentially uses more resources per 
task. Unlike Mantri, none of the existing approaches act 
early or replicate output. Further, naively extending cur- 
rent schemes to act early without being cognizant of the 
cost of resources, as we show in $6, leads to worse perfor- 
mance. 

Closed-loop action allows Mantri to act optimistically 
by bounding the cost when probabilistic predictions go 
awry. For example, even when Mantri cannot ascertain 
the cause of an outlier, it experimentally starts copies. If 
the cause does not repeatedly impact the task, the copy 
can finish faster. To handle the contrary case, Mantri con- 
tinuously monitors running copies and kills those whose 
cost exceeds the benefit. 

Based on task progress reports, Mantri estimates for 
each task the remaining time to finish, t,e;,, and the pre- 
dicted completion time of a new copy of the task, trew. 
Tasks report progress once every 10s or ten times in their 
lifetime, whichever is smaller. We use A to refer to this 
period. We defer details of the estimation to §5.5 and pro- 
ceed to describe the algorithms for mitigating each of the 
main causes of outliers. All that matters is that t,e., be 
an accurate estimate and that the predicted distribution 
tnew account for the underlying work that the task has to 
do, the appropriateness of the network location and any 
persistent slowness of the new machine. 


5.1 Resource-aware Restart 


We begin with a simple example to help exposition. Fig- 
ure 9 shows a phase that has seven tasks and two slots. 
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1: let A = period of progress reports 

2: let c = number of copies of a task 

3: periodically, for each running task, kill all but the fastest a copies 
after A time has passed since begin 





4: while slots are available do 

5 if tasks are waiting for slots then 

6 kill, restart task iftrem > E(tnew) +A, stop at ¥ restarts 

a: duplicate if P(trem > tnew ott) >6 

8: start the waiting task that has the largest data to read 

9 else > all tasks have begun 
10: duplicate iff E(tnew — trem) > pA 
141: end if 


12: end while 


Pseudocode 1: Algorithm for Resource-aware restarts (simpli- 


fied). 


Normal tasks run for times ¢t and 2¢. One outlier has a 
runtime of 5¢. Time increases along the x axes. 

The timeline at the top shows a baseline which ignores 
outliers and finishes at 7t. Prior approaches that only ad- 
dress outliers at the end of the phase also finish at 7¢. 

Note that if this outlier has a large amount of data to 
process letting the straggling task be is better than killing 
or duplicating it, both of which waste resources. 

If however, the outlier was slowed down by its loca- 
tion, the second and third timelines compare duplication 
to a restart that kills the original copy. After a short time 
to identify the outlier, the scheduler can duplicate it at 
the next available slot (the middle time-line) or restart it 
in-place (the bottom timeline). If prediction is accurate, 
restarting is strictly better. However, if slots are going idle, 
it may be worthwhile to duplicate rather than incur the 
risk of losing work by killing. 

Duplicating the outlier costs a total of 3¢ in re- 
sources (2t before the original task is killed and ¢ for the 
duplicate) which may be wasteful if the outlier were to 
finish in sooner than 3t by itself. 


Restart Algorithm: Mantri uses two variants of restart, 
the first kills a running task and restarts it elsewhere, 
the second schedules a duplicate copy. In either method, 
Mantri restarts only when the probability of success, i.e., 
P(tnew < trem) is high. Since trey accounts for the sys- 
tematic differences and the expected dynamic variation, 
Mantri does not restart tasks that are normal (e.g., runtime 
proportional to work). Pseudocode 1 summarizes the al- 
gorithm. Mantri kills and restarts a task if its remaining 
time is so large that there is a more than even chance that 
a restart would finish sooner. In particular, Mantri does so 
when trem > E(tnew) +A *. To not thrash on inaccurate 
estimates, Mantri kills a task no more than y = 3 times. 
The “kill and restart" scheme drastically improves the 
job completion time without requiring extra slots as we 
show analytically in [5]. However, the current job sched- 
uler incurs a queueing delay before restarting a task, that 














“Since the median of the heavy tailed task completion time 
distribution is smaller than the mean, this check implies that 
P (tnrew < trem) > P (tnew < E(tnew)) 25 








(a) Ad-hoc placement 


(b) Even spread 


Figure 10: Three reduce tasks (rhombus boxes) are to be placed 
across three racks. The rectangles indicate their input. The type 
of the rectangle indicates the map that produced this data. Each 
reduce task has to process one shard of each type. The ad-hoc 
placement on the left creates network bottlenecks on the cross- 
rack links (highlighted). Tasks in such racks will straggle. If the 
network has no other traffic, the even placement on the right 
avoids hotspots. 


can be large and highly variant. Hence, we consider 
scheduling duplicates. 


Scheduling a duplicate results in the minimum com- 
pletion time of the two copies and provides a safety net 
when estimates are noisy or the queueing delay is large. 
However, it requires an extra slot and if allowed to run 
to finish, consumes extra computation resource that will 
increase the job completion time if outstanding tasks are 
prevented from starting. Hence, when there are outstand- 
ing tasks and no spare slots, we schedule a duplicate only 
if the total amount of computation resource consumed 
decreases. In particular, if c copies of the task are cur- 
rently running, a duplicate is scheduled only if P(trem > 
tne ott) > 6. By default, 6 = .25. For example, a 
task with one running copy is duplicated only if tnew is 
less than half of tem. For stability, Mantri does not re- 
duplicate a task for which it launched a copy recently. Any 
copy that has run for some time and is slower than the 
second fastest copy of the task will be killed to conserve 
resources. Hence, there are never more than three run- 
ning copies of a task’. When spare slots are available, as 
happens towards the end of the job, Mantri schedules du- 
plicates more aggressively, i.e., whenever the reduction in 
the job completion time is larger than the start up time, 
U(tnew —trem) > pA. By default, p = 3. Note that in all 
the above cases, if more than one task satisfies the neces- 
sary conditions, Mantri breaks ties in favor of the task that 
will benefit the most. 














Mantri’s restart algorithm is independent of the values 
for its parameters. Setting y to a larger and p,6 to a 
smaller value trades off the risk of wasteful restarts for 
the reward of a larger speed-up. The default values that 
are specified here err on the side of caution. 

By scheduling duplicates conservatively and pruning 
aggressively, Mantri has a high success rate of its restarts. 
As a result, it reduces completion time and conserves re- 
sources ( $6.2). 


°The two fastest copies and the copy that has recently started. 
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5.2 Network-Aware Placement 


Reduce tasks, as noted before ($4.2), have to read data 
across racks. A rack with too many reduce tasks is con- 
gested on its downlink and such tasks will straggle. Fig- 
ure 10 illustrates such a scenario. 

Given the utilization of all the network links and the 
locations of inputs for all the tasks (and jobs) that are 
waiting to run, optimally placing the tasks to minimize 
job completion time is a form of the centralized traffic 
engineering problem [14, 18]. However achieving up- 
to-date information of network state and centralized co- 
ordination across all jobs in the cluster are challenging. 
Instead, Mantri approximates the optimal placement by 
a local algorithm that does not track bandwidth changes 
nor require co-ordination across jobs. 

With Mantri, each job manager places tasks so as 
to minimize the load on the network and avoid self- 
interference among its tasks. If every job manager takes 
this independent action, network hotspots will not cause 
outliers. Note that the sizes of the map outputs in each 
rack are known to the job manager prior to placing the 
tasks of the subsequent reduce phase. For a reduce phase 
with n tasks running on a cluster with r racks, let its input 
matrix J,, , specify the size of input in each rack for each 
of the tasks®. For any placement of reduce tasks to racks, 
let the data to be moved out (on the uplink) and read in 
(on the downlink) on the i” rack be di, d’,, and the corre- 
sponding available bandwidths be b’, and b’, respectively. 


d 
For each rack, we compute two terms co;-1 = a and 


C44 = oe. The first term is the ratio of outgoing traffic 
d 


and available uplink bandwidth, and the second term is 
the ratio of incoming traffic and available downlink band- 
width. The algorithm computes the optimal value over all 
placement permutations, i.e., the rack location for each 
task that minimizes the maximum data transfer time, as 
argminmax;¢;, Jj =1,--- ,2n,. 

Rather than track the available bandwidths b’, and b’, 
as they change with time and as a function of other jobs 
in the cluster, Mantri uses these estimates. Reduce phases 
with a small amount of data finish quickly, and the band- 
widths can be assumed to be constant throughout the ex- 
ecution of the phase. For phases with a large amount of 
data, the bandwidth averaged over their long lifetime can 
be assumed to be equal for all links. We see that with these 
estimates Mantri’s placement comes close to the ideal in 
our experiments (see §6.4). 

For phases other than reduce, Mantri complements the 
Cosmos policy of placing a task close to its data [23]. By 
accounting for the cost of moving data over low band- 
width links in tw, Mantri ensures that no copy is started 


°In I, the row sum indicates the data to be read by the task, whereas 
the column sum indicates the total input present in that rack. 
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Figure 11: Avoiding costly recomputations: The cost to redo a 
task includes the recursive probability of predecessor tasks hav- 
ing to be re-done (a). Replicating output reduces the effective 
probability of loss (b). Tasks with many-to-one input patterns 
have high recomputation cost and are more valuable (c). 


at a location where it has little chance of finishing earlier 
thereby not wasting resources. 


5.3 Avoiding Recomputation 


To mitigate costly recomputations that stall a job, Mantri 
protects against interim data loss by replicating task out- 
put. It acts early by replicating those outputs whose cost to 
recompute exceeds the cost to replicate. Mantri estimates 
the cost to recompute as the product of the probability 
that the output will be lost and the time to repeat the task. 
The probability of loss is estimated for a machine over a 
long period of time. The time to repeat the task is tredo 
with a recursive adjustment that accounts for the task’s 
inputs also being lost. Figure 11 illustrates the calcula- 
tion of t,-edo based on the data loss probabilities (r;’s), the 
time taken by the tasks (¢;’s) and recursively looks at prior 
phases. Replicating the output reduces the likelihood of 
recomputation to the case when all replicas are unavail- 
able. Ifa task reads input from many tasks (e.g., a reduce), 
tredo is higher since any of the inputs needing to be re- 
computed will stall the task’s recomputation ’. The cost 
to replicate, tcp, is the time to move the data to another 
machine in the rack. 

In effect, the algorithm replicates tasks at key places in 
a job’s workflow — when the cumulative cost of not repli- 
cating many successive tasks builds up or when tasks ran 
on very flaky machines (high r;) or when the output is so 
small that replicating it would cost little (low trep). 

Further, to avoid excessive replication, Mantri limits the 
amount of data replicated to 10% of the data processed by 
the job. This limit is implemented by granting tokens pro- 
portional to the amount of data processed by each task. 
Task output that satisfies the above cost-benefit check is 


7In Fig. 11(c), we assume that if multiple inputs are lost, they are re- 
computed in parallel and the task is stalled by the longest input. Since 
recomputes are rare (Fig. 2(a)), this is a fair approximation of practice. 
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replicated only if an equal number of tokens are available. 
Tokens are deducted on replication. 

Mantri proactively recomputes tasks whose output and 
replicas, if any, have been lost. From §4, we see that re- 
computations on a machine cluster by time, hence Mantri 
considers a recompute to be the onset of a temporal prob- 
lem which will cause future requests for data on this ma- 
chine to fail and pre-computes such output. Doing so de- 
creases the time that a dependent task will have to wait 
for lost input to be regenerated. As before, Mantri im- 
poses a budget on the extra cluster cycles used for pre- 
computation. Together, probabilistic replication and pre- 
computation approximate the ideal scheme in our evalu- 
ation ($6.5). 


5.4 Data-aware Task Ordering 


Workload imbalance causes tasks to straggle. Mantri does 
not restart outliers that take a long time to run because 
they have more work to do. Instead, Mantri improves job 
completion time by scheduling tasks in a phase in de- 
scending order of their input size. Given n tasks, s slots 
and input sizes d[1---nJ], if the optimal completion time 
is Tc, scheduling tasks in inverse order of their input sizes 
will take T’, where ts < * _ a [12]. This means that 
scheduling tasks with the longest processing time first is 
at most 33% worse than the optimal schedule; computing 
the optimal is NP-hard [12]. 


5.5 Estimation of t,.,, and tye. 


Periodically, every running task informs the job sched- 
uler of its status, including how many bytes it has read, 
dread> thus far. Mantri combines the progress reports with 
the size of the input data that each task has to process, 
d, and predicts how much longer the task would take to 
finish using this model: 


+r twrapup: (2) 
read 


The first term captures the remaining time to process 
data. The second term is the time to compute after all 
the input has been read and is estimated from the be- 
havior of earlier tasks in the phase. Tasks may speed up 
or slow down and hence, rather than extrapolating from 
each progress report, Mantri uses a moving average. To 
be robust against lost progress reports, when a task hasn't 
reported for a while, Mantri increases tem by assuming 
that the task has not progressed since its last report. This 
linear model for estimating the remaining time for a task 
is well suited for data-intensive computations like Map- 
Reduce where a task spends most of its time reading the 
input data. We seldom see variance in computation time 
among tasks that read equal amounts of data [26]. 


trem = telapsed 


Mantri estimates t,,¢,,, the distribution over time that a 
new copy of the task will take to run, as follows: 


tnew = processRate * locationFactor * d + schedLag. (3) 


The first term is a distribution of the process rate, i.e., 
se of all the tasks in this phase. The second term is 
a relative factor that accounts for whether the candidate 
machine for running this task is persistently slower (or 
faster) than other machines or has smaller (or larger) ca- 
pacity on the network path to where the task’s inputs are 
located. The third term, as before, is the amount of data 
the task has to process. The last term is the average delay 
between a task being scheduled and when it gets to run. 
We show in §6.2 that these estimates of tye, and tnew are 
sufficiently accurate for Mantri’s functioning. 





6 Evaluation 


We deployed and evaluated Mantri on Bing’s production 
cluster consisting of thousands of servers. Mantri has been 
running as the outlier mitigation module for all the jobs 
in Bing’s clusters since May 2010. To compare against a 
wider set of alternate techniques, we built a trace driven 
simulator that replays logs from production. 


6.1 Setup 


Clusters: The production cluster consists of thousands 
of server-class multi-core machines with tens of GBs of 
RAM that are spread roughly 40 servers to a rack. This 
cluster is used by Bing product groups. The data we an- 
alyzed earlier is from this cluster, so the observations 
from $4 hold here. 


Workload: Mantri is the default outlier mitigation solu- 
tion for the production cluster. The jobs submitted to 
this cluster are independent of us, enabling us to evalu- 
ate Mantri’s performance in a live cluster across a variety of 
production jobs. We compare Mantri’s performance on all 
jobs in the month of June 2010 with prior runs of the same 
jobs in April-May 2010 that ran with the earlier build of 
Cosmos. 

In addition, we also evaluate Mantri on four hand- 
picked applications that represent common building 
blocks. Word Count calculates the number of unique 
words in the input. Table Join inner joins two tables each 
with three columns of data on one of the columns. Group 
By counts the number of occurrences of each word in the 
file. Finally, grep searches for string patterns in the input. 
We vary input sizes from 53 GB to 500 GB. 


Prototype: Manitri builds on the Cosmos job scheduler 
and consists of about 1000 lines of C++ code. To com- 
pute ¢,.¢7, Mantri maintains an execution record for each 
of the running tasks that is updated when the task reports 
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progress. A phase-wide data structure stores the neces- 
sary statistics to compute te,. When slots become avail- 
able, Mantri runs Pseudocode 1 and restarts or duplicates 
the task that would benefit the most or starts new tasks 
in descending order of data size. To place tasks appropri- 
ately, name builds on the per-task affinity list, a preferred 
set of machines and racks that the task can run on. At 
run-time the job manager attempts to place the task at 
its preferred locations in random order, and when none 
of them are available runs the task at the first available 
slot. The affinity list for map tasks has machines that have 
replicas of the input blocks. For reduce tasks, to obtain 
the desired proportional spread across racks (see §5.2), 
we populate the affinity list with a proportional number 
of machines in those racks. 


Trace-driven Simulator: The simulator replays the logs 
shown in Table 1. For each phase, it faithfully repeats 
the observed distributions of task completion times, data 
read by each task, size and location of inputs, probability 
of failures and recomputations, and fairness based evic- 
tions. Restarted tasks have their execution times and fail- 
ure probabilities sampled from the same distribution of 
tasks in their phase. The simulator also mimics the job 
workflow including semantics like barriers before phases, 
the permissible concurrent slots per phase and the in- 
put/output relationships between phases. It mimics clus- 
ter characteristics like machine failures, network conges- 
tion and availability of computation slots. For the net- 
work, it uses a fluid model rather than simulating indi- 
vidual packets. Doing the latter, at petabyte scale, is out 
of scope for this work. 


Compared Schemes: Our results on the production clus- 
ter uses the current Dryad implementation as the base- 
line ($6.2). It contains state-of-the-art outlier mitigation 
strategies and runs thousands of jobs daily. 


Our simulator performs a wider and detailed com- 
parison. It compares Mantri with the outlier mitigation 
strategies in Hadoop [1], Dryad [13], Map-Reduce [11], 
LATE [20], and a modified form of LATE that acts on 
stragglers early in the phase. As the current Dryad build 
already has modules for straggler mitigation, we com- 
pare all of these schemes to a baseline that does not miti- 
gate any stragglers ($6.3). On the other hand, since these 
schemes do not do network-aware placement or recom- 
pute mitigation, we use the current Dryad implementa- 
tion itself as their baseline ($6.4 and $6.5). 


We also compare Mantri against some ideal bench- 
marks. NoSkew mimics the case when all tasks in a phase 
take the same amount of time, set to the average over 
the observed task durations. NoSkew + ChopTail goes 
even further, it removes the worst quartile of the observed 
durations, and sets every task to the average of remain- 
ing durations. IdealReduce assumes perfect up-to-date 
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Figure 12: Evaluation of Mantri as the default build for all jobs 
on the production cluster for twenty-five days. 


knowledge of available bandwidths and places reduce 
tasks accordingly. IdealRecompute uses future knowledge 
of which tasks will have their inputs recomputed and pro- 
tects those inputs. 


Metrics: As our primary metrics, we use the reduction in 


completion time and resource usage*, where 
Current — Modified 


(4) 


Reduction = 
Current 


Summary: Our results are summarized as follows: 

e In live deployment in the production cluster Mantri 
sped up the median job by 32%. 55% of the jobs ex- 
perienced a net reduction in resources used. Further 
Mantri’s network-aware placement reduced the com- 
pletion times of typical reduce phases by 31%. 

e Simulations driven from production logs show that 
Mantri’s restart strategy reduces the completion time of 
phases by 21% (and 42%) at the 50°” (and 75*”) per- 
centile. Here, Mantri’s reduction in completion time 
improves on Hadoop by 3.1x while using fewer re- 
sources than Map-Reduce, each of which are the cur- 
rent best on those respective metrics. 

e Mantri’s network-aware placement of tasks speeds up 
half of the reduce phases by at least 60% each. 

e Mantri reduces the completion times due to recomputa- 
tions of jobs that constitute 25% (or 50%) of the work- 
load by at least 40% (or 20%) each while consuming 
negligible extra resources. 


6.2 Deployment Results 


Jobs in the Wild: We compare one month of jobs in the 
Bing production cluster that ran after Mantri was turned 
live with runs of the same job(s) on earlier builds. We 
use only those recurring jobs that have roughly similar 
amounts of input and output across runs. Figure 12(a) 
plots the CDF of the improvement in completion time. 
The y axes weighs each job by the total time its tasks take 
to run since improvement on larger jobs adds more value 


°A reduction of 50% implies that the property in question, com- 
pletion time or resources used, decreases by half. Negative values of 
reduction imply that the modification uses more resources or takes 
longer. 
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Figure 13: Comparing Mantri’s straggler mitigation with the 
baseline implementation on a production cluster of thousands 
of servers for the four representative jobs. 


% reduction in completion time 





avg min max 
Phase 21.5 28.4 34.2 
Job 12.6 7.0 19.2 


Table 2: Comparing Mantri’s network-aware spread of tasks 
with the baseline implementation on a production cluster of 
thousands of servers. 


to the cluster. Jobs that occupy the cluster for half the time 
sped up by at least 32.1%. Figure 12(b) shows that 55% of 
jobs see a reduction in resource consumption while the 
others use up a few extra resources. These gains are due 
to Mantri’s ability to detect outliers early and accurately. 
The success rate of Mantri’s copies, i.e., the fraction of time 
they finish before the original copy, improves by 2.8x over 
the earlier build. At the same time, Mantri expends fewer 
resources, it starts .47x fewer copies. Further, Mantri acts 
early, over 50% of its copies are started before the original 
task has completed 42% of its work as opposed to 77% 
with the earlier build. 


Straggler Mitigation: To cross-check the above results 
on standard jobs, we ran four prototypical jobs with and 
without Mantri twenty times each. Figure 13 shows that 
job completion times improve by roughly 25% and re- 
source usage falls by roughly 10%. The histograms plot 
the average reduction, error bars are the 10” and 90°” 
percentiles of samples. Further, we logged all the progress 
reports for these jobs. We find that Mantri’s predictor, 
based on reports from the recent past, estimates tyem, to 
within a 2.9% error of the actual completion time. 


Placement of Tasks: To evaluate Mantri’s network-aware 
spreading of reduce tasks, we ran Group By, a job with a 
long-running reduce phase, ten times on the production 
cluster. Table 2 shows that the reduce phase’s completion 
time reduces by 28.4% on average causing the job to speed 
up by an average of 12.6%. To understand why, we mea- 
sure the spread of tasks, i-e., the ratio of the number of 
concurrent reduce tasks to the number of racks they ran 
in. High spread implies that some racks have more tasks 
which interfere with each other while other racks are idle. 
Mantri’s spread is 1.5 compared to 5.5 for the earlier build. 

To compare against alternative schemes and to piece 
apart gains from the various algorithms in Mantri, we 
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Figure 14: Comparing straggler mitigation strategies. Mantri 
provides a greater speed-up in completion time while using 
fewer resources than existing schemes. 


present results from the trace-driven simulator. 


6.3 Can Mantri mitigate stragglers? 


Figure 14 compares straggler mitigation strategies in their 
impact on completion time and resource usage. The y- 
axes weighs phases by their lifetime since improving the 
longer phases improves cluster efficiency. The figures plot 
the cumulative reduction in these metrics for the 210K 
phases in Table 1 with each repeated thrice. For this sec- 
tion, our common baseline is the scheduler that takes no 
action on outliers. Recall from §6.1 that the simulator re- 
plays the task durations and the anomalies observed in 
production. 

Figures 14(a) and 14(b) show that Mantri improves 
completion time by 21% and 42% at the 50°” and 75'” 
percentiles and reduces resource usage by 3% and 7% at 
these percentiles. From Figure 14(a), at the 50°” per- 
centile, Mantri sped up phases by an additional 3.1x over 
the 6.9% improvement of Hadoop, the next best scheme. 
To achieve the smaller improvement Hadoop uses 15.9% 
more resources (Fig. 14(b)). Map-Reduce and Dryad 
have no positive impact for 80% and 50% of the phases 
respectively. Up to the 30” percentile Dryad increases 
the completion time of phases. LATE is similar in its time 
improvement to Hadoop but uses fewer resources. 

The reason for poor performance is that they miss out- 
liers that happen early in the phase and by not knowing 
the true causes of outliers, the duplicates they schedule are 
mostly not useful. Mantri and Dryad schedule .2 restarts 
per task for the average phase (.06 and .56 for LATE and 
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and slightly worse than NoSkew+Chop Tail (see end of $6.3) 
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Figure 17: By being network aware, Mantri speeds up the me- 
dian reduce phase by 60% over the current placement. 


Hadoop). But, Mantri’s restarts have a success rate of 70% 
compared to the 15% for LATE. The other schemes have 
lower success rates. 


While the insight of early action on stragglers is valu- 
able, it is nonetheless non trivial. We evaluate this in Fig- 
ures 15(a) and 15(b) that present a form of LATE that 
is identical in all ways except that it addresses stragglers 
early. We see that addressing stragglers early increases 
completion time up to the 40°” percentile, uses more re- 
sources and is worse than vanilla LATE. Being resource 
aware is crucial to get the best out of early action ($5.1). 


Finally, Fig. 16 shows that Mantri is on par with the ideal 
benchmark that has no variation in tasks, NoSkew, and is 
slightly worse than the variant that removes all durations 
in the top quartile, NoSkew+ChopTail. The reason is that 
Mantri’s ability to substitute long running tasks with their 
faster copies makes up for its inability to act with perfect 
future knowledge of which tasks straggle. 
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6.4 Does Mantri improve placement? 


Figure 17 plots the reduction in completion time due to 
Mantri’s placement of reduce tasks as a CDF over all re- 
duce phases in the dataset in Table 1. As before, the y- 
axes weighs phases by their lifetime. The figure shows that 
Mantri provides a median speed up of 59% or a 2.5x im- 
provement over the current implementation. 

The figure also compares Mantri against strategies that 
estimate available bandwidths differently. The IdealRe- 
duce strategy tracks perfectly the changes in available 
bandwidth of links due to the other jobs in the cluster. The 
Equal strategy assumes that the available bandwidths are 
equal across all links whereas Start assumes that the avail- 
able bandwidths are the same as at the start of the phase. 
We see a partial order between Start and Equal (the two 
solid lines). Short phases are impacted by transient dif- 
ferences in the available bandwidths and Start is a good 
choice for these phases. However, these differences even 
out over the lifetime of long phases for whom Equal works 
better. Mantri is a hybrid of Start and Equal. It achieves a 
good approximation of IdealReduce without re-sampling 
available bandwidths. 

To capture how Mantri’s placement differs from Dryad, 
Figure 18 plots the ratio of the throughput obtained by the 
median task in each reduce phase to that obtained by the 
slowest task. With Mantri, this ratio is 1.05 at median and 
never larger than 2. In contrast, with Dryad’s policy of 
placing tasks at the first available slot, this ratio is 5.25 (or 
14.33) at the 50°” (or 75") percentile. Note that duplicat- 
ing tasks that are delayed due to network congestion with- 
out considering the available bandwidths or where other 
tasks are located would be wasteful. 


6.5 Does Mantri help with recomputations? 


The best possible protection against loss of output would 
(a) eliminate all the increase in job completion time due 
to tasks waiting for their inputs to be recomputed and (b) 
do so with little additional cost. Mantri approximates both 
goals. Fig. 19 shows that Mantri achieves parity with Ideal- 
Recompute. Recall that IdealRecompute has perfect future 
knowledge of loss. The improvement in job completion 
time is 20% (40%) at the 50” (75*”) percentile. 

The reason is that Mantri’s policy of selective replica- 
tion is both accurate and biased towards the more expen- 
sive recomputations. The probability that task output that 
was replicated will be used because the original data be- 
comes unavailable is 84%. Similarly, the probability that a 
pre-computation becomes useful is 76%, which increases 
to 93% if pre-computations are triggered only when two 
recomputations happen at a machine in quick succes- 
sion. Figure 20 shows the complementary contributions 
from replication and pre-computation- each contribute 
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Figure 18: Unlike Dryad, Mantri’s placement provides more 
consistent throughput to tasks in reduce phases. 
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Figure 19: By probabilistically replicating task output and 
recomputing lost data before it is needed Mantri speeds up 
jobs by an amount equal to the ideal case of no data loss. 
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to Mantri’s recomputation mitigation strategy, along with indi- 
vidual contributions from replication and pre-computation. 


roughly 66% and 33% to the total. Cumulatively, the fig- 
ure shows that Mantri eliminates 78% of recomputations 
for the median job. We note that Mantri ignores 75% of 
the recomputations in the bottom quartile of jobs since 
their impact on job completion time is small. 

Fig. 21(a) shows that the extra network traffic due to 
replication is (overall negligible and) comparable to Ide- 
alReduce. Mantri sometimes replicates more data than the 
ideal, and at other times misses some tasks that should be 
replicated. Fig. 21(b) shows that pre-computations take 
only a few percentage extra resources. 


7 Related Work 


Much recent work focuses on large scale data parallel 
computing. Following on the Map-Reduce [11] paper, 
there has been work in improving workflows [1, 13], lan- 
guage design [8, 27], and fair schedulers [18]. Our work 
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Figure 21: The cost to protect against recomputes is fewer than 
a few percentage points in both the extra traffic on the network 
and cluster time for pre-computation. 


here takes the next step of understanding how such pro- 
duction clusters behave and can be improved. 

Run-time stragglers have been identified by past 
work [11, 20]. However, we are the first to character- 
ize the prevalence of stragglers in production and their 
causes. By understanding the causes, addressing strag- 
glers early and scheduling duplicates only when there is 
a fair chance that the speculation saves both time and re- 
sources, our approach provides a greater reduction in job 
completion time while using fewer resources than prior 
strategies that duplicate tasks towards the end of a phase. 
Also, we uniquely avoid network hotspots and protect 
against loss of task output, two further causes of outliers. 

By only acting at the end ofa phase, current schemes [1, 
11, 13] miss early outliers. They vary in the choice of 
which tasks to duplicate. After a threshold number of 
tasks have finished, Map-Reduce [11] duplicates all the 
tasks that remain. Dryad [13] duplicates those that have 
been running for longer than the 75th percentile of task 
durations. After all tasks have started, Hadoop [1] uses 
slots that free up to duplicate any task that has read less 
data than the others, while LATE [20] duplicates only 
those reading at a slow rate. 

Though some recent proposals do away with capacity 
over-subscription in data centers [3, 17], today’s networks 
remain over-subscribed albeit at smaller levels. It is com- 
mon to place tasks near their input (same machine, rack 
etc.) for map and at the first free slot for reduce [1, 11, 13]. 
Our approach to eliminate outliers by a network-aware 
placement is orthogonal to recent work that packs tasks 
requiring different resources on to a machine [25], or 
trades-off fairness with efficiency [18]. Quincy accounts 
for capacity but not for runtime variations in bandwidth 
due to competition from other tasks. 

ISS [15] protects intermediate data by replicating 
locally-consumed data. In particular, this does not in- 
clude map output, since Hadoop transfers map output to 
reduce tasks as it is produced. ISS’s replication strategy 
runs the risk of being both wasteful (when very few ma- 
chines are error-prone) and insufficient (when the trans- 
fer of map output fails). In contrast, Mantri presents a 
broader solution that (a) replicates task output based on 
the probability of data loss and the recursive cost of re- 
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computing inputs and (b) pre-computes lost data. 

The Map-Reduce paradigm is similar to parallel 
databases in its goal of analyzing large data [22] and to 
dedicated HPC clusters and parallel programs [16] by 
presenting similar optimization opportunities. In the 
context of multiple processors, studies have been done on 
the classic problem of dynamic task scheduling [4, 6] as 
well as task duplication [24]. Star-MPI [2] adapts param- 
eters like network topology between a set of communi- 
cating processors by observing performance over time. 
Prior work has also focused on modeling and optimiz- 
ing the communication in parallel programs [10, 19, 21] 
that have one-to-all or all-to-all traffic, i-e., where ev- 
ery receiver processes all of the output of tasks in earlier 
stages. In the context of the many-to-many traffic, typical 
of Map-Reduce, we present practical techniques for band- 
width estimation and task placement that realizes near- 
optimal performance. 


8 Conclusion 


Mantri delivers effective mitigation of outliers in Map- 
Reduce networks. It is motivated by, what we believe is, 
the first study of a large production Map-Reduce clus- 
ter. The root of Mantri’s advantage lies in integrating 
static knowledge of job structure and dynamically avail- 
able progress reports into a unified framework that iden- 
tifies outliers early, applies cause-specific mitigation and 
does so only if the benefit is higher than the cost. In our 
implementation on a cluster of thousands of servers, we 
find Mantri to be highly effective. 

Outliers are an inevitable side-effect of parallelizing 
work. They hurt Map-Reduce networks more due to the 
structure of jobs as graphs of dependent phases that pass 
data from one to the other. Their many causes reflect the 
interplay between the network, storage and, computation 
in Map-Reduce. Current systems shirk this complexity 
and assume that a duplicate would speed things up. Mantri 
embraces it to mitigate a broad set of outliers. 
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Abstract 


Distributed in-memory application data caches like mem- 
cached are a popular solution for scaling database-driven 
web sites. These systems are easy to add to existing de- 
ployments, and increase performance significantly by re- 
ducing load on both the database and application servers. 
Unfortunately, such caches do not integrate well with 
the database or the application. They cannot maintain 
transactional consistency across the entire system, vio- 
lating the isolation properties of the underlying database. 
They leave the application responsible for locating data 
in the cache and keeping it up to date, a frequent source 
of application complexity and programming errors. 

Addressing both of these problems, we introduce a 
transactional cache, TxCache, with a simple program- 
ming model. TxCache ensures that any data seen within 
a transaction, whether it comes from the cache or the 
database, reflects a slightly stale but consistent snap- 
shot of the database. TxCache makes it easy to add 
caching to an application by simply designating func- 
tions as cacheable; it automatically caches their results, 
and invalidates the cached data as the underlying database 
changes. Our experiments found that adding TxCache 
increased the throughput of a web application by up to 
5.2, only slightly less than a non-transactional cache, 
showing that consistency does not have to come at the 
price of performance. 


1 Overview 


Today’s web applications are used by millions of users 
and demand implementations that scale accordingly. A 
typical system includes application logic (often imple- 
mented in web servers) and an underlying database that 
stores persistent state, either of which can become a bot- 
tleneck [1]. Increasing database capacity is typically a 
difficult and costly proposition, requiring careful parti- 
tioning or the use of distributed databases. Application 
server bottlenecks can be easier to address by adding 
more nodes, but this also quickly becomes expensive. 
Application-level data caches, such as mem- 
cached [24], Velocity/AppFabric [34] and NCache [25], 
are a popular solution to server and database bottlenecks. 


They are deployed extensively by well-known web ap- 
plications like LiveJournal, Facebook, and MediaWiki. 
These caches store arbitrary application-generated data in 
a lightweight, distributed in-memory cache. This flexibil- 
ity allows an application-level cache to act as a database 
query cache, or to act as a web cache and cache entire 
web pages. But increasingly complex application logic 
and more personalized web content has made it more use- 
ful to cache the result of application computations that 
depend on database queries. Such caching is useful be- 
cause it averts costly post-processing of database records, 
such as converting them to an internal representation, or 
generating partial HTML output. It also allows common 
content to be cached separately from customized con- 
tent, so that it can be shared between users. For example, 
MediaWiki uses memcached to store items ranging from 
translations of interface messages to parse trees of wiki 
pages to the generated HTML for the site’s sidebar. 


Existing caches like memcached present two chal- 
lenges for developers, which we address in this paper. 
First, they do not ensure transactional consistency with 
the rest of the system state. That is, there is no way to 
ensure that accesses to the cache and the database re- 
turn values that reflect a view of the entire system at a 
single point in time. While the backing database goes 
to great length to ensure that all queries performed in a 
transaction reflect a consistent view of the database, i.e. it 
can ensure serializable isolation, it is nearly impossible 
to maintain these consistency guarantees while using a 
cache that operates on application objects and has no 
notion of database transactions. The resulting anomalies 
can cause incorrect information to be exposed to the user, 
or require more complex application logic because the 
application must be able to cope with violated invariants. 


Second, they offer only a GET/PUT interface, plac- 
ing full responsibility for explicitly managing the cache 
with the application. Applications must assign names to 
cached values, perform lookups, and keep the cache up 
to date. This has been a common source of programming 
errors in applications that use memcached. In particular, 
applications must explicitly invalidate cached data when 
the database changes. This is often difficult; identifying 
every cached application computation whose value may 
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have been changed requires global reasoning about the 
application. 

We address both problems in our transactional cache, 
TxCache. TxCache provides the following features: 


e transactional consistency: all data seen by the appli- 
cation reflects a consistent snapshot of the database, 
whether the data comes from cached application- 
level objects or directly from database queries. 

e access to slightly stale but nevertheless consistent 
snapshots for applications that can tolerate stale data, 
improving cache utilization. 

e a simple programming model, where applications 
simply designate functions as cacheable. The Tx- 
Cache library then handles inserting the result of the 
function into the cache, retrieving that result the next 
time the function is called with the same arguments, 
and invalidating cached results when they change. 


To achieve these goals, TxCache introduces the follow- 
ing noteworthy mechanisms: 


e a protocol for ensuring that transactions see only 
consistent cached data, using minor database modi- 
fications to compute the validity times of database 
queries, and attaching them to cache objects. 

e a lazy timestamp selection algorithm that assigns a 
transaction to a timestamp in the recent past based 
on the availability of cached data. 

e an automatic invalidation system that tracks each ob- 
ject’s database dependencies using dual-granularity 
invalidation tags, and produces notifications if they 
change. 


We ported the RUBiS auction website prototype and 
MediaWiki, a popular web application, to use TxCache, 
and evaluated it using the RUBiS benchmark [2]. Our 
cache improved peak throughput by 1.5 — 5.2x depend- 
ing on the cache size and staleness limit, an improvement 
oonly slightly below that of a non-transactional cache. 

The next section presents the programming model and 
consistency semantics. Section 3 sketches the structure 
of the system, and Sections 4—6 describe each component 
in detail. Section 7 describes our experiences porting ap- 
plications to TxCache, Section 8 presents a performance 
evaluation, and Section 9 reviews the related work. 


2 System and Programming Model 


TxCache is designed for systems consisting of one or 
more application servers that interact with a database 
server. These application servers could be web servers 
running embedded scripts (e.g. with mod_php), or dedi- 
cated application servers, as with Sun’s Enterprise Java 
Beans. The database server is a standard relational 
database; for simplicity, we assume the application uses 
a single database to store all of its persistent state. 
TxCache introduces two new components, as shown in 
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Figure 1: Key components in a TxCache deployment. 
The system consists of a single database, a set of cache 
nodes, and a set of application servers. TxCache also 
introduces an application library, which handles all inter- 
actions with the cache server. 
































Figure |: a cache and an application-side cache library, 
as well as some minor modifications to the database 
server. The cache is partitioned across a set of cache 
nodes, which may run on dedicated hardware or share 
it with other servers. The application never interacts 
with the cache servers; the TxCache library transparently 
translates an application’s cacheable functions into cache 
accesses. 


2.1 Programming Model 


Our goal is to make it easy to incorporate caching into a 
new or existing application. Towards this end, TxCache 
provides an application library with a simple program- 
ming model, shown in Figure 2, based on cacheable func- 
tions. Applications developers can cache computations 
simply by designating functions to be cached. 

Programs group their operations into transactions. Tx- 
Cache requires applications to specify whether their trans- 
actions are read-only or read/write by using either the 
BEGIN-RO or BEGIN-RW function. Transactions are 
ended by calling COMMIT or ABORT. Within a transac- 
tion block, TxCache ensures that, regardless of whether 
the application gets its data from the database or the 
cache, it sees a view consistent with the state of the 
database at a single point in time. 

Within a transaction, operations can be grouped into 
cacheable functions. These are actual functions in the pro- 
gram’s code, annotated to indicate that their results can 
be cached. A cacheable function can consist of database 
queries and computation, and can also make calls to other 
cacheable functions. To be suitable for caching, functions 
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@ BEGIN-RO(staleness) : Begin a read-only transac- 
tion. The transaction sees a consistent snapshot 
from within the past staleness seconds. 

@ BEGIN-RW(): Begin a read/write transaction. 

@ COMMIT() > timestamp : Commit a transaction and 
return the timestamp at which it ran 

e ABORT(): Abort a transaction 


@ MAKE-CACHEABLE(fn) — cached-fn : Makes a 
function cacheable. cached-fn is a new function 
that first checks the cache for the result of an- 
other call with the same arguments. If not found, 
it executes fn and stores its result in the cache. 


Figure 2: TxCache library API 


must be pure, i.e. they must be deterministic, not have 
side effects, and depend only on their arguments and the 
database state. For example, it would not make sense to 
cache a function that returns the current time. TxCache 
currently relies upon programmers to ensure that they 
only cache suitable functions, but this requirement could 
also be enforced using static or dynamic analysis [14, 33]. 

Cacheable functions are essentially memoized. Tx- 
Cache’s library provides a MAKE-CACHEABLE function 
that takes an implementation of a cacheable function and 
returns a wrapper function that can be called to take ad- 
vantage of the cache. When called, the wrapper function 
checks if the cache contains the result of a previous call 
to the function with the same arguments that is consistent 
with the current transaction’s snapshot. If so, it returns 
it. Otherwise, it invokes the implementation function 
and stores the returned value in the cache. With proper 
linguistic support (e.g. Python decorators), marking a 
function cacheable can be as simple as adding a tag to its 
existing definition. 

Our cacheable function interface is easier to use than 
the GET/PUT interface provided by existing caches like 
memcached. It does not require programmers to manually 
assign keys to cached values and keep them up to date. 
Although seemingly straightforward, this is nevertheless 
a source of errors because selecting keys requires reason- 
ing about the entire application and how the application 
might evolve. Examining MediaWiki bug reports, we 
found that several memcached-related MediaWiki bugs 
stemmed from choosing insufficiently descriptive keys, 
causing two different objects to overwrite each other [22]. 
In one case, a user’s watchlist page was always cached 
under the same key, causing the same results to be re- 
turned even if the user requested to display a different 
number of days worth of changes. 

TxCache’s programming model has another crucial 
benefit: it does not require applications to explicitly up- 
date or invalidate cached results when modifying the 


database. Adding explicit invalidations requires global 
reasoning about the application, hindering modularity: 
adding caching for an object requires knowing every 
place it could possibly change. This, too, has been a 
source of bugs in MediaWiki [23]. For example, edit- 
ing a wiki page clearly requires invalidating any cached 
copies of that page. But other, less obvious objects must 
be invalidated too. Once MediaWiki began storing each 
user’s edit count in their cached USER object, it became 
necessary to invalidate this object after an edit. This was 
initially forgotten, indicating that identifying all cached 
objects needing invalidation is not straightforward, espe- 
cially in applications so complex that no single developer 
is aware of the whole of the application. 


2.2 Consistency Model 


TxCache provides transactional consistency: all requests 
within a transaction see a consistent view of the system 
as of a specific timestamp. That is, requests see only 
the effects of other transactions that committed prior to 
that timestamp. For read/write transactions, TxCache 
supports this guarantee by running them directly on the 
database, bypassing the cache entirely. Read-only trans- 
actions use objects in the cache, and TxCache ensures 
that nevertheless they view a consistent state. 

Most caches return slightly stale data simply because 
modified data does not reach the cache immediately. Tx- 
Cache goes further by allowing applications to specify an 
explicit staleness limit to BEGIN-RO, indicating that that 
the transaction can see a view of data from that time or 
later. However, regardless of the age of the snapshot, each 
transaction always sees a consistent view. This feature 
is motivated by the observation that many applications 
can tolerate a certain amount of staleness [18], and using 
stale cached data can improve the cache’s hit rate [21]. 

Applications can specify their staleness limit on a per- 
transaction basis. Additionally, when a transaction com- 
mits, TxCache provides the user with the timestamp at 
which it ran. Together, these can be used to avoid anoma- 
lies. For example, an application can store the timestamp 
of a user’s last transaction in its session state, and use that 
as a Staleness bound so that the user never observes time 
moving backwards. More generally, these timestamps 
can be used to ensure a causal ordering between related 
transactions [20]. 

We chose to have read/write transactions bypass the 
cache entirely so that TxCache does not introduce new 
anomalies. The application can expect the same guaran- 
tees (and anomalies) of the underlying database. For ex- 
ample, if the underlying database uses snapshot isolation, 
the system will still have the same anomalies as snap- 
shot isolation, but TxCache will never introduce snapshot 
isolation anomalies into the read/write transactions of a 
system that does not use snapshot isolation. Our model 
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could be extended to allow read/write transactions to read 
information from the cache, if applications are willing 
to accept the risk of anomalies. One particular challenge 
is that read/write transactions typically expect to see the 
effects of their own updates, while these cannot be made 
visible to other transactions until the commit point. 


3 System Architecture 


In order to present an easy-to-use interface to application 
developers, TxCache needs to store cached data, keep it 
up to date, and ensure that data seen by an application is 
transactionally consistent. This section and the following 
ones describe how it achieves this using cache servers, 
modifications to the database, and an application-side 
library. None of this complexity, however, is visible to 
the application, which sees only cachable functions. 

An application running with TxCache accesses infor- 
mation from the cache whenever possible, and from the 
database on a cache miss. To ensure it sees a consistent 
view, TxCache uses versioning. Each database query 
has an associated validity interval, describing the range 
of time over which its result was valid, which is com- 
puted automatically by the database. The TxCache li- 
brary tracks the queries that a cached value depends on, 
and uses them to tag the cache entry with a validity inter- 
val. Then, the library provides consistency by ensuring 
that, within each read-only transaction, it only retrieves 
values from the cache and database that were valid at 
the same time. Thus, each transaction effectively sees a 
snapshot of the database taken at a particular time, even 
as it accesses data from the cache. 

Section 4 describes how the cache is structured, and de- 
fines how a cached object’s validity interval and database 
dependencies are represented. Section 5 describes how 
the database is modified to track query validity intervals 
and provide invalidation notifications when a query’s re- 
sult changes. Section 6 describes how the library tracks 
dependencies for application objects, and selects consis- 
tent values from the cache and database. 


4 Cache Design 


TxCache stores cached data in RAM on a number of 
cache servers. The cache presents a hash table interface: 
it maps keys to associated values. Applications do not 
interact with the cache directly; the TxCache library trans- 
lates the name and arguments of a function call into a 
hash key, and checks and updates the cache itself. 

Data is partitioned among cache nodes using a consis- 
tent hashing approach [17], as in peer-to-peer distributed 
hash tables [31, 35]. Unlike DHTs, we assume that the 
system is small enough that every application node can 
maintain a complete list of cache servers, allowing it to 
immediately map a key to the responsible node. This 
list could be maintained by hand in small systems, or 
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Key 2 ——_| _ 


Timestamp «—+{—+—_1—_+—_ +1 +_ 111+ > 
45 50 55 


Figure 3: An example of versioned data in the cache at 
one point in time. Each rectangle is a version of a data 
item. For example, the data for key 1 became valid with 
commit 51 and invalid with commit 53, and the data for 
key 2 became valid with commit 46 and is still valid. 


using a group membership service [10] in larger or more 
dynamic environments. 


4.1 Versioning 


Unlike a simple hash table, our cache is versioned. In 
addition to its key, each entry in the cache is tagged with 
its validity interval, as shown in Figure 3. This interval is 
the range of time at which the cached value was current. 
Its lower bound is the commit time of the transaction 
that caused it to become valid, and its upper bound is the 
commit time of the first subsequent transaction to change 
the result, making the cache entry invalid. The cache 
can store multiple cache entries with the same key; they 
will have disjoint validity intervals because only one is 
valid at any time. Whenever the TxCache library puts 
the result of a cacheable function call into the cache, it 
includes the validity interval of that result (derived using 
information obtained from the database). 

To look up a result in the cache, the TxCache library 
sends both the key it is interested in and a timestamp 
or range of acceptable timestamps. The cache server re- 
turns a value consistent with the library’s request, i.e. one 
whose validity interval intersects the given range of ac- 
ceptable timestamps, if any exists. The server also returns 
the value’s associated validity interval. If multiple such 
values exist, the cache server returns the most recent one. 

When a cache node runs out of memory, it evicts old 
cached values to free up space for new ones. Cache 
entries are never pinned and can always be discarded; if 
one is later needed, it is simply a cache miss. A cache 
eviction policy can take into account both the time since 
an entry was accessed, and its staleness. Our cache server 
uses a least-recently-used replacement policy, but also 
eagerly removes any data too stale to be useful. 


4.2 Invalidation Tags and Streams 


When an object is inserted into the cache, it can be flagged 
as still-valid if it reflects the latest state of the database, 
like Key 2 in Figure 3. For such objects, the database 
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provides invalidation notifications when they change. 

Every still-valid object has an associated set of inval- 
idation tags that describe which parts of the database 
it depends on. Each invalidation tag has two parts: a 
table name and an optional index key description. The 
database identifies the invalidation tags for a query based 
on the access methods used to access the database. A 
query that uses an index equality lookup receives a two- 
part tag, e.g. a search for users with name Alice would 
receive tag USERS:NAME=ALICE. A query that performs 
a sequential scan or index range scan has a wildcard for 
the second part of the tag, e.g. USERS:«. Wildcard invali- 
dations are expected to be very rare because applications 
typically try to perform only index lookups; they exist 
primarily for completeness. Queries that access multiple 
tables or multiple keys in a table receive multiple tags. 
The object’s final tag set will have one or more tags for 
each query that the object depends on. 

The database distributes invalidations to the cache as 
an invalidation stream. This is an ordered sequence of 
messages, one for each update transaction, containing the 
transaction’s timestamp and all invalidation tags that it 
affected. Each message is delivered to all cache nodes by 
a reliable application-level multicast mechanism [10], or 
by link-level broadcast if possible. The cache servers pro- 
cess the messages in order, truncating the validity interval 
for any affected object at the transaction’s timestamp. 

Using the same transaction timestamps to order cache 
entries and invalidations eliminates race conditions that 
could occur if an invalidation reaches the cache server 
before an item is inserted with the old value. These race 
conditions are a real concern: MediaWiki does not cache 
failed article lookups, because a negative result might 
never be removed from the cache if the report of failure 
is stale but arrived after its corresponding invalidation. 

For cache lookup purposes, items that are still valid are 
treated as though they have an upper validity bound equal 
to the timestamp of the last invalidation received prior to 
the lookup. This ensures that there is no race condition 
between an item being changed on the database and in- 
validated in the cache, and that multiple items modified 
by the same transaction are invalidated atomically. 


5 Database Support 


The validity intervals that TxCache uses in its cache 
are derived from validity information generated by the 
database. To make this possible, TxCache uses a modi- 
fied DBMS that has similar versioning properties to the 
cache. Specifically, it can run queries on slightly stale 
snapshots, and it computes validity intervals for each 
query result it returns. It also assigns invalidation tags to 
queries, and produces the invalidation stream described 
in Section 4.2. 

Though standard databases do not provide these fea- 


tures, we show they can be implemented by reusing the 
same mechanisms that are used to implement multiver- 
sion concurrency control techniques like snapshot isola- 
tion. In this section, we describe how we modified an ex- 
isting DBMS, PostgreSQL [29], to provide the necessary 
support. The modifications are not extensive (under 2000 
lines of code in our implementation). Moreover, they 
are not Postgres-specific; the approach can be applied to 
other databases that use multiversion concurrency. 


5.1 Exposing Multiversion Concurrency 


Because our cache allows read-only transactions to run 
slightly in the past, the database must be able to perform 
queries against a past snapshot of a database. This sit- 
uation arises when a read-only transaction is assigned 
a timestamp in the past and reads some cached data, 
and then a later operation in the same transaction results 
in a cache miss, requiring the application to query the 
database. The database query must return results consis- 
tent with the cached values already seen, so the query 
must execute at the same timestamp in the past. 


Temporal databases, which track the history of their 
data and allow “time travel,” solve this problem but im- 
pose substantial storage and indexing cost to support 
complex queries over the entire history of the database. 
What we require is much simpler: we only need to run a 
transaction on a stale but recent snapshot. Our insight is 
that these requirements are essentially identical to those 
for supporting snapshot isolation [5], so many databases 
already have the infrastructure to support them. 


We modified Postgres to expose the multiversion stor- 
age it uses internally to provide snapshot isolation. We 
added a PIN command that assigns an ID to a read-only 
transaction’s snapshot. When starting a new transaction, 
the TxCache library can specify this ID using the new 
BEGIN SNAPSHOTID syntax, creating a new transaction 
that sees the same view of the database as the erstwhile 
read-only transaction. The database state for that snap- 
shot will be retained at least until it is released by the 
UNPIN command. A pinned snapshot is identified by the 
commit time of the last committed transaction visible to 
it, allowing it to be easily ordered with respect to update 
transactions and other snapshots. 


Postgres is especially well-suited to this modifica- 
tion because of its “no-overwrite” storage manager [36], 
which already retains recent versions of data. Because 
stale data is only removed periodically by an asyn- 
chronous “vacuum cleaner’ process, the fact that we keep 
data around slightly longer has little impact on perfor- 
mance. However, our technique is not Postgres-specific; 
any database that implements snapshot isolation must 
have a way to keep a similar history of recent database 
states, such as Oracle’s rollback segments. 
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Query Timestamp 


Tuple 4 





Result Validity 


Invalidity Mask 


Invalidity Mask 
Validity:Interval 


Commits 43 44 45 46 47 48 49 


' 


Figure 4: Example of tracking the validity interval for a 
read-only query. All four tuples match the query predi- 
cate. Tuples 1 and 2 match the timestamp, so their inter- 
vals intersect to form the result validity. Tuples 3 and 4 
fail the visibility test, so their intervals join to form the in- 
validity mask. The final validity interval is the difference 
between the result validity and the invalidity mask. 


5.2 Tracking Result Validity 


TxCache needs the database server to provide the va- 
lidity interval for every query result in order to ensure 
transactional consistency of cached objects. Recall that 
this is defined as the range of timestamps for which the 
query would give the same results. Its lower bound is the 
commit time of the most recent transaction that added, 
deleted, or modified any tuple in the result set. It may 
have an upper bound if a subsequent transaction changed 
the result, or it may be unbounded if the result is still 
current. 

The validity interval is computed as the intersection 
of two ranges, the result tuple validity and the invalidity 
mask, which we track separately. 

The result tuple validity is the intersection of the valid- 
ity times of the tuples returned by the query. For example, 
tuple 1 in Figure 4 was deleted at time 47, and tuple 2 
was created at time 44; the result would be different be- 
fore time 44 or after time 47. This interval is easy to 
compute because multiversion concurrency requires that 
each tuple in the database be tagged with the ID of its 
creating transaction and deleting transaction (if any). We 
simply propagate these tags throughout query execution. 
If an operator, such as a join, combines multiple tuples to 
produce a single result, the validity interval of the output 
tuple is the intersection of its inputs. 

The result tuple validity, however, does not completely 
capture the validity of a query, because of phantoms. 
These are tuples that did not appear in the result, but 
would have if the query were run at a different timestamp. 
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For example, tuple 3 in Figure 4 will not appear in the 
results because it was deleted before the query timestamp, 
but the results would be different if the query were run 
before it was deleted. Similarly, tuple 4 is not visible 
because it was created afterwards. We capture this effect 
with the invalidity mask, which is the union of the va- 
lidity times for all tuples that failed the visibility check, 
i.e. were discarded because their timestamps made them 
invisible to the transaction’s snapshot. Throughout query 
execution, whenever such a tuple is encountered, its va- 
lidity interval is added to the invalidity mask. 

The invalidity mask is conservative because visibility 
checks are performed as early as possible in the query 
plan to avoid processing unnecessary tuples. Some of 
these tuples might have been discarded anyway if they 
failed the query conditions later in the query plan (per- 
haps after joining with another table). While being con- 
servative preserves the correctness of the cached results, 
it might unnecessarily constrain the validity intervals of 
cached items, reducing the hit rate. To ameloriate this 
problem, we continue to perform the visibility check as 
early as possible, but during sequential scans and index 
lookups, we evaluate the predicate before the visibility 
check. This differs from regular Postgres with respect to 
sequential scans, where it evaluates the cheaper visibility 
check first. Delaying the visibility checks improves the 
quality of the invalidity mask, and incurs little overhead 
for simple predicates, which are most common. 

Finally, the invalidity mask is subtracted from the re- 
sult tuple validity to give the query’s final validity in- 
terval. This interval is reported to the TxCache library, 
piggybacked on each SELECT query result; the library 
combines these intervals to obtain validity intervals for 
objects it stores in the cache. 


5.3 Automating Invalidations 


When the database executes a query and reports that its 
validity interval is unbounded, i.e. the query result is still 
valid, it assumes responsibility for providing an invalida- 
tion when the result may have changed. At query time, 
it must assign invalidation tags to indicate the query’s 
dependencies, and at update time, it must notify the cache 
of invalidation tags for objects that might have changed. 

When a query is performed, the database examines the 
query plan it generates. At the lowest level of the tree are 
the access methods that obtain the data, e.g. a sequential 
scan of a heap file, or a B-tree index lookup. For index 
equality lookups, the database assigns an invalidation tag 
of the form TABLE:KEY. For other types, it assigns a 
wildcard tag TABLE:*. Each query may have multiple 
tags; the complete set is returned along with the SELECT 
query results. 

When a read/write transaction modifies some tuples, 
the database identifies the set of invalidation tags affected. 
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Each tuple added, deleted, or modified yields one inval- 
idation tag for each index it is listed in. If a transaction 
modifies most of a table, the database can aggregate multi- 
ple tags into a single wildcard tag on TABLE:x. Generated 
invalidation tags are queued until the transaction commits. 
When it does, the database server passes the set of tags, 
along with the transaction’s timestamp, to the multicast 
service for distribution to the cache nodes, ensuring that 
the invalidation stream is properly ordered. 


5.4 Pincushion 


TxCache needs to keep track of which snapshots are 
pinned on the database, and which of those are within 
a read-only transaction’s staleness limit. It also must 
eventually unpin old snapshots, provided that they are 
not used by running transactions. The DBMS itself could 
be responsible for tracking this information. However, to 
simplify implementation, and to reduce the overall load 
on the database, we placed this functionality instead in a 
lightweight daemon known as the pincushion (so named 
because it holds the pinned snapshot IDs). It can be run 
on the database host, on a cache server, or elsewhere. 

The pincushion maintains a table of currently pinned 
snapshots, containing the snapshot’s ID, the correspond- 
ing wall-clock timestamp, and the number of running 
transactions that might be using it. When the TxCache 
library running on an application node begins a read-only 
transaction, it requests from the pincushion all sufficiently 
fresh pinned snapshots, e.g. those pinned in the last 30 
seconds. The pincushion flags these snapshots as possibly 
in use, for the duration of the transaction. If there are no 
sufficiently fresh pinned snapshots, the TxCache library 
starts a read-only transaction on the database, running on 
the latest snapshot, and pins that snapshot. It then regis- 
ters the snapshot’s ID and the wall-clock time (as reported 
by the database) with the pincushion. The pincushion 
also periodically scans its list of pinned snapshots, re- 
moving any unused snapshots older than a threshold by 
sending an UNPIN command to the database. 

Though the pincushion is accessed on every transac- 
tion, it performs little computation and is unlikely to form 
a bottleneck. In all of our experiments, nearly all pin- 
cushion requests received a response in under 0.2 ms, 
approximately the network round-trip time. We have also 
developed a protocol for replicating the pincushion to in- 
crease its throughput, but it has yet to become necessary. 


6 Cache Library 


Applications interact with TxCache through its 
application-side library, which keeps them blissfully 
unaware of the details of cache servers, validity intervals, 
invalidation tags and the like. It is responsible for as- 
signing timestamps to read-only transactions, retrieving 
values from the cache when cacheable functions are 


called, storing results in the cache, and computing the 
validity intervals and invalidation tags for anything it 
stores in the cache. 

In this section, we describe the implementation of the 
TxCache library. For clarity, we begin with a simplified 
version where timestamps are chosen when a transac- 
tion begins and cacheable functions do not call other 
cacheable functions. In Section 6.2, we describe a tech- 
nique for choosing timestamps lazily to take better advan- 
tage of cached data. In Section 6.3, we lift the restriction 
on nested calls. 


6.1 Basic Functionality 


The TxCache library is divided into a language- 
independent library that implements the core functional- 
ity, and a set of bindings that implement language-specific 
interfaces. Currently, we have only implemented bind- 
ings for PHP, but adding support for other languages 
should be relatively straightforward. 

Recall from Figure 2 that the library’s interface is 
simple: it provides the standard transaction commands 
(BEGIN, COMMIT, and ABORT), and functions are desig- 
nated as cacheable using a MAKE-CACHEABLE function 
that takes a function and returns a wrapped function that 
first checks for available cached values!. 

When a transaction is started, the application specifies 
whether it is read/write or read-only, and, if read-only, the 
staleness limit. For a read/write transaction, the TxCache 
library simply starts a transaction on the database server, 
and passes all queries directly to it. At the beginning of a 
read-only transaction, the library contacts the pincushion 
to request the list of pinned snapshots within the staleness 
limit, then chooses one to run the transaction at. If no 
sufficiently recent snapshots exist, the library starts a new 
transaction on the database and pins its snapshot. 

The library can delay beginning an underlying read- 
only transaction on the database (i.e. sending a BEGIN 
SQL statement) until it actually needs to issue a query. 
Thus, transactions whose requests are all satisfied from 
the cache do not need to connect to the database at all. 

When a cacheable function’s wrapper is called, the 
library checks whether its result is in the cache. To do so, 
it serializes the function’s name and arguments into a key 
(a hash of the function’s code could also be used to handle 
software updates). The library finds the responsible cache 
server using consistent hashing, and sends it a LOOKUP 
request. The request includes the transaction’s timestamp, 
which any returned value must satisfy. If the cache returns 
a matching result, the library returns it directly to the 
program. 

In the event of a cache miss, the library calls the 
cacheable function’s implementation. As the cacheable 


'In languages such as PHP that lack higher-order functions, the 
syntax is slightly more complicated, but the concept is the same. 
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function issues queries to the database, the library ac- 
cumulates the validity intervals and invalidation tags re- 
turned by these queries. The final result of the cacheable 
function is valid at all times in the intersection of the 
accumulated validity intervals. When the cacheable func- 
tion returns, the library serializes its result and inserts 
it into the cache, tagged with the accumulated validity 
interval and any invalidation tags. 


6.2 Choosing Timestamps Lazily 


Above, we assumed that the library chooses a read-only 
transaction’s timestamp when the transaction starts. Al- 
though straightforward, this approach requires the library 
to decide on a timestamp without any knowledge of what 
data is in the cache or what data will be accessed. Lack- 
ing this knowledge, it is not clear what policy would 
provide the best hit rate. 

However, the timestamp need not be chosen immedi- 
ately. Instead, it can be chosen lazily based on which 
cached results are available. This takes advantage of 
the fact that each cached value is valid over a range of 
timestamps: its validity interval. For example, consider 
a transaction that has observed a single cached result x. 
This transaction can still be serialized at any timestamp 
in x’s validity interval. On the transaction’s next call to 
a cacheable function, any cached value whose validity 
interval overlaps x’s can be chosen, as this still ensures 
there is at least one timestamp at which the transaction 
can be serialized. As the transaction proceeds, the set of 
possible serialization points narrows each time the trans- 
action reads a cached value or a database query result. 

Specifically, the algorithm proceeds as follows. When 
a transaction begins, the library requests from the pin- 
cushion all pinned snapshot IDs that satisfy its freshness 
requirement. It stores this set as its pin set. The pin 
set represents the set of timestamps at which the current 
transaction can be serialized; it will be updated as the 
cache and the database are accessed. The pin set also 
initially contains a special ID, denoted x, which indicates 
that the transaction can also be run in the present, on some 
newly pinned snapshot. The pin set only contains x until 
the first cacheable function in the transaction executes. 

When the application invokes a cacheable function, the 
library sends a LOOKUP request for the appropriate key, 
but instead of indicating a single timestamp, it indicates 
the bounds of the pin set (the lowest and highest times- 
tamp, excluding x). The transaction can use any cached 
value whose validity interval overlaps these bounds and 
still remain serializable at one or more timestamps. The 
library then reduces the transaction’s pin set by eliminat- 
ing all timestamps that do not lie in the returned value’s 
validity interval, since observing a cached value means 
the transaction can no longer be serialized outside its 
validity interval. This includes removing x from the pin- 
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set because once the transaction has used cached data, it 
cannot be run on a new, possibly inconsistent snapshot. 

When the cache does not contain any entries that match 
both the key and the requested interval, a cache miss 
occurs. In this case, the library calls the cacheable func- 
tion’s implementation, as before. When the transaction 
makes its first database query, the library is finally forced 
to select a specific timestamp from the pin set and BE- 
GIN a read-only transaction on the database at the chosen 
timestamp. If a non-* timestamp is chosen, the transac- 
tion runs on that timestamp’s saved snapshot. If x is cho- 
sen, the library starts a new transaction, pinning the latest 
snapshot and reporting the pin to the pincushion. The pin 
set is then reified: x is replaced with the newly-created 
snapshot’s timestamp, replacing the abstract concept of 
“the present time” with a concrete timestamp. 

The library needs a policy to choose which pinned 
snapshot from the pin set it should run at. Simply choos- 
ing x if available, or the most recent timestamp otherwise, 
biases transactions towards running on recent data, but 
results in a very large number of pinned snapshots, which 
can ultimately slow the system down. To avoid the over- 
head of creating many snapshots, we used the following 
policy: if the most recent timestamp in the pin set is 
older than five seconds and x is available, then the library 
chooses x in order to produce a new pinned snapshot; 
otherwise it chooses the most recent timestamp. 

During the execution of a cacheable function, the va- 
lidity intervals of the queries that the function makes are 
accumulated, and their intersection defines the validity 
interval of the cacheable result, just as before. In addi- 
tion, just like when a transaction observes values from 
the cache, each time it observes query results from the 
database, the transaction’s pin set is reduced by eliminat- 
ing all timestamps outside the result’s validity interval, as 
the transaction can no longer be serialized at these points. 
If the transaction’s pin set still contains x, x is removed. 

The validity interval of the cacheable function and pin 
set of the transaction are two distinct but related notions: 
the function’s validity interval is the set of timestamps 
at which its result is valid, and the pin set is the set of 
timestamps at which the enclosing transaction can be 
serialized. The pin set always lies within the validity 
interval, but the two may differ when a transaction calls 
multiple cacheable functions in sequence, or performs 
“bare” database queries outside a cacheable function. 


6.2.1 Correctness 


Lazy selection of timestamps is a complex algorithm, 
and its correctness is not self-evident. The following two 
properties show that it provides transactional consistency. 


Invariant 1. All data seen by the application during 
a read-only transaction is consistent with the database 
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State at every timestamp in the pin set, i.e. the transaction 
can be serialized at any timestamp in the pin set. 


Invariant | holds because any timestamps inconsistent 
with data the application has seen are removed from the 
pin set. The application sees two types of data: cached 
values and database query results. Each is tagged with its 
validity interval. The library removes from the pin set all 
timestamps that lie outside either of these intervals. 


Invariant 2. The pin set is never empty, i.e. the transac- 
tion can always be serialized at some timestamp. 


The pin set is initially non-empty: it contains the times- 
tamps of all sufficiently-fresh pinned snapshots, if any, 
and always x. So we must ensure that at least one times- 
tamp remains every time the pin set shrinks, i.e. when a 
result is obtained from the cache or database. 

When a value is fetched from the cache, its validity 
interval is guaranteed to intersect the transaction’s pin set 
at at least one timestamp. The cache will only return an 
entry with a non-empty intersection between its validity 
interval and the bounds of the transaction’s pin set. This 
intersection contains the timestamp of at least one pinned 
snapshot: if the result’s validity interval lies partially 
within and partially outside the bounds of the client’s pin 
set, then either the earliest or latest timestamp in the pin 
set lies in the intersection. If the result’s validity interval 
lies entirely within the bounds of the transaction’s pin 
set, then the pin set contains at least the timestamp of 
the pinned snapshot from which the cached result was 
originally generated. Thus, Invariant 2 continues to hold 
even after removing from the pin set any timestamps that 
do not lie within the cached result’s validity interval. 

It is easier to see that when the database returns a 
query result, the validity interval intersects the pin set 
at at least one timestamp. The validity interval of the 
query result must contain the timestamp of the pinned 
snapshot at which it was executed, by definition. That 
pinned snapshot was chosen by the TxCache library from 
the transaction’s pin set (or it chose x, obtained a new 
snapshot, and added it to the pin set). Thus, at least that 
one timestamp will remain in the pin set after intersecting 
it with the query’s validity interval. 


6.3 Handling Nested Calls 


In the preceding sections, we assumed that cacheable 
functions never call other cacheable functions. However, 
it is useful to be able to nest calls to cacheable functions. 
For example, a user’s home page at an auction site might 
contain a list of items the user recently bid on. We might 
want to cache the description and price for each item as 
a function of the item ID (because they might appear on 
other user’s pages) in addition to the complete content of 
the user’s page (because he might access it again). 


Our implementation supports nested calls; this does 
not require any fundamental changes to the approach 
above. However, we must keep track of a separate cumu- 
lative validity interval and invalidation tag set for each 
cacheable function in the call stack. When a cached value 
or database query result is accessed, its validity interval is 
intersected with that of each function currently on the call 
stack. As a result, a nested call to a cacheable function 
may have a wider validity interval than its enclosing func- 
tion, but not vice versa. This makes sense, as the outer 
function might have seen more data than the functions it 
calls (e.g. if it calls more than one cacheable function). 
Similarly, any invalidation tags from the database are 
attached to each function on the call stack, as each now 
has a dependency on the data. 


7 Experiences 


We implemented all the components of TxCache, in- 
cluding the cache server, database modifications to Post- 
greSQL to support validity tracking and invalidations, 
and the cache library with PHP language bindings. 

One of TxCache’s goals is to make it easier to add 
caching to a new or existing application. The TxCache 
library makes it straightforward to designate a function 
as cacheable. However, ensuring that the program has 
functions suitable for caching still requires some effort. 
Below, we describe our experiences adding support for 
caching to the RUBiS benchmark and to MediaWiki. 


7.1 Porting RUBiS 


RUBiS [2] is a benchmark that implements an auction 
website modeled after eBay where users can register 
items for sale, browse listings, and place bids on items. 
We ported its PHP implementation to use TxCache. Like 
many small PHP applications, the PHP implementation 
of RUBiS consists of 26 separate PHP scripts, written 
in an unstructured way, which mainly make database 
queries and format their output. Besides changing code 
that begins and ends transactions to use TxCache’s inter- 
faces, porting RUBiS to TxCache involved identifying 
and designating cacheable functions. The existing im- 
plementation had few functions, so we had to begin by 
dividing it into functions; this was not difficult and would 
be unnecessary in a more modular implementation. 

We cached objects at two granularities. First, we 
cached large portions of the generated HTML output 
(except some headers and footers) for each page. This 
meant that if two clients viewed the same page with the 
same arguments, the previous result could be reused. Sec- 
ond, we cached common functions such as authenticating 
a user’s login, or looking up information about a user or 
item by ID. Even these fine-grained functions were often 
more complicated than an individual query; for example, 
looking up an item requires examining both the active 
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items table and the old items table. These fine-grained 
cached values can be shared between different pages; for 
example, if two search results contain the same item, the 
description and price of that item can be reused. 

We made a few modifications to RUBiS that were not 
strictly necessary but improved its performance. To take 
better advantage of the cache, we modified the code for 
display lists of items to obtain details about each item 
by calling our GET-ITEM cacheable function rather than 
performing a join on the database. We also observed that 
one interaction, finding all the items for sale in a particu- 
lar region and category, required performing a sequential 
scan over all active auctions, and joining it against the 
users table. This severely impacted the performance of 
the benchmark with or without caching. We addressed 
this by adding a new table and index containing each 
item’s category and region IDs. Finally, we removed a 
few queries that were simply redundant. 


7.2 Porting MediaWiki 


We also ported MediaWiki to use TxCache, to better un- 
derstand the process of adding caching to a more complex, 
existing system. MediaWiki, which faces significant scal- 
ing challenges in its use for Wikipedia, already supports a 
variety of caches and replication systems. Unlike RUBiS, 
it has an object-oriented design, making it easier to select 
cacheable functions. 

MediaWiki supports master-slave replication for the 
database server. Because the slaves cannot process up- 
date transactions and lag slightly behind the master, Me- 
diaWiki already distinguishes the few transactions that 
must see the latest state from the majority that can accept 
the staleness caused by replication lag (typically 1-30 
seconds). It also identifies read/write transactions, which 
must run on the master. Although we used only one 
database server, we took advantage of this classification 
of transactions to determine which transactions can be 
cached and which must execute directly on the database. 

Most MediaWiki functions are class member functions. 
Caching only pure functions requires being sure that func- 
tions do not mutate their object. We cached only static 
functions that do not access or modify global variables 
(MediaWiki rarely uses global variables). Of the non- 
static functions, many can be made static by explicitly 
passing in any member variables that are used, as long 
as they are only read. For example, almost every func- 
tion in the TITLE class, which represents article titles, is 
cacheable because a TITLE object is immutable. 

Identifying functions that would be good candidates 
for caching was more challenging, as MediaWiki is a 
complex application with myriad features. Developers 
with previous experience with the MediaWiki codebase 
would have more insight into which functions were fre- 
quently used. We looked for functions that were involved 
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in common requests like rendering an article, and mem- 
ber functions of commonly-used classes. We focused on 
functions that constructed objects based on data looked 
up in the database, such as fetching a page revision. These 
were good candidates for caching because we can avoid 
the cost of one or more database queries, as well as the 
cost of post-processing the data from the database to fill 
the fields of the object. We also adapted existing caches 
like the localization cache, which stores translations of 
user interface messages. 


8 Evaluation 


We used RUBiS as a benchmark to explore the perfor- 
mance benefits of caching. In addition to the PHP auction 
site implementation described above, RUBiS provides a 
client emulator that simulates many concurrent user ses- 
sions: there are 26 possible user interactions (e.g. brows- 
ing items by category, viewing an item, or placing a bid), 
each of which corresponds to a transaction. We used 
the standard RUBiS “bidding” workload, a mix of 85% 
read-only interactions (browsing) and 15% read/write in- 
teractions (placing bids) with a think time with negative 
exponential distribution and 7-second mean. 

We ran our experiments on a cluster of 10 servers, each 
a Dell PowerEdge SC1420 with two 3.20 GHz Intel Xeon 
CPUs, 2 GB RAM, and a Seagate ST3 1500341 AS 7200 
RPM hard drive. The servers were connected via a gigabit 
Ethernet switch, with 0.1 ms round-trip latency. One 
server was dedicated to the database; it ran PostgreSQL 
8.2.11 with our modifications. The others acted as front- 
end web servers running Apache 2.2.12 with PHP 5.2.10, 
or as cache nodes. Four other machines, connected via 
the same switch, served as client emulators. Except as 
otherwise noted, database server load was the bottleneck. 

We used two different database configurations. One 
configuration was chosen so that the dataset would fit 
easily in the server’s buffer cache, representative of appli- 
cations that strive to fit their working set into the buffer 
cache for performance. This configuration had about 
35,000 active auctions, 50,000 completed auctions, and 
160,000 registered users, for a total database size about 
850 MB. The larger configuration was disk-bound; it had 
225,000 active auctions, 1 million completed auctions, 
and 1.35 million users, for a total database size of 6 GB. 

For repeatability, each test ran on an identical copy 
of the database. We ensured the cache was warm by 
restoring its contents from a snapshot taken after one hour 
of continuous processing for the in-memory configuration 
and one day for the disk-bound configuration. 

For the in-memory configuration, we used seven hosts 
as web servers, and two as dedicated cache nodes. For the 
larger configuration, eight hosts ran both a web server and 
a cache server, in order to make a larger cache available. 
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Figure 6: Effect of cache size on cache hit rate (30 second staleness limit) 


8.1 Cache Sizes and Performance 


We evaluated RUBiS’s performance in terms of the peak 
throughput achieved (requests handled per second) as 
we varied the number of emulated clients. Our baseline 
measurement evaluates RUBiS running directly on the 
Postgres database, with TxCache disabled. This achieved 
a peak throughput of 928 req/s with the in-memory config- 
uration and 136 req/s with the disk-bound configuration. 

We performed this experiment with both a stock copy 
of Postgres, and our modified version. We found no 
observable difference between the two cases, suggesting 
our modifications have negligible performance impact. 
Because the system already maintains multiple versions 
to implement snapshot isolation, keeping a few more 
versions around adds little cost, and tracking validity 
intervals and invalidation tags simply adds an additional 
bookkeeping step during query execution. 

We then ran the same experiment with TxCache en- 
abled, using a 30 second staleness limit and various cache 
sizes. The resulting peak throughput levels are shown 
in Figure 5. Depending on the cache size, the speedup 
achieved ranged from 2.2x to 5.2x for the in-memory 
configuration and from 1.8x to 3.2x for the disk-bound 
configuration. The RUBiS PHP benchmark does not per- 
form significant application-level computation; even so, 
we see a 15% reduction in total web server CPU usage. 
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Cache server load is low, with most CPU overhead in 
kernel time, suggesting inefficiencies in the kernel’s TCP 
stack as the cause. Switching to a UDP protocol might 
alleviate some of this overhead [32]. 

Figure 6(a) shows that for the in-memory configura- 
tion, the cache hit rate ranged from 27% to 90%, increas- 
ing linearly until the working set size is reached, and 
then growing slowly. Here, the cache hit rate directly 
translates into a performance improvement because each 
cache hit represents load (often many queries) removed 
from the database. Interestingly, we always see a high 
hit rate on the disk-bound database (Figure 6(b)) but it 
does not always translate into a large performance im- 
provement. This workload exhibits some very frequent 
queries (e.g. looking up a user’s nickname by ID) that can 
be stored in even a small cache, but are also likely to be 
in the database’s buffer cache. It also has a large number 
of data items that are each accessed rarely (e.g. the full 
bid history for each item). The latter queries collectively 
make up the bottleneck, and the speedup is determined 
by how much of this data is in the cache. 


8.2 Varying Staleness Limits 


The staleness limit is an important parameter. By raising 
this value, applications may be exposed to increasingly 
stale data, but are able to take advantage of more cached 
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Figure 7: Impact of staleness limit on peak throughput 


data. An invalidated cache entry remains useful for the 
duration of the staleness limit, which is valuable for val- 
ues that change (and are invalidated) frequently. 

Figure 7 compares the peak throughput obtained by 
running transactions with staleness limits from 1 to 120 
seconds. Even a small staleness limit of 5-10 seconds 
provides a significant benefit. RUBiS has some objects 
that are expensive to compute and have many data depen- 
dencies (indexes of all items in particular regions with 
their current prices). These objects are invalidated fre- 
quently, but the staleness limit permits them to be used. 
The benefit diminishes at around 30 seconds, suggesting 
that the bulk of the data either changes infrequently (such 
as information about inactive users or auctions), or is 
accessed multiple times every 30 seconds (such as the 
aforementioned index pages). 


8.3 Costs of Consistency 


A natural question is how TxCache’s guarantee of trans- 
actional consistency affects its performance. We explore 
this question by examining cache statistics and compar- 
ing against other approaches. 

We classified cache misses into four types, inspired by 
the common classification for CPU cache misses: 


e compulsory miss: the object was never in the cache 

e staleness miss: the object has been invalidated, and 
its staleness limit has been exceeded 

© capacity miss: the object was previously evicted 

@ consistency miss: some sufficiently fresh version of 
the object was available, but it was inconsistent with 
previous data read by the transaction 


Figure 8 shows the breakdown of misses by type for four 
different configurations. Our cache server unfortunately 
cannot distinguish staleness and capacity misses. We see 
that consistency misses are the least common by a large 
margin. Consistency misses are rare, as items in the cache 
are likely to have overlapping validity intervals, either 
because they change rarely or the cache contains multiple 
versions. Workloads with higher staleness limits experi- 
ence more consistency misses (but fewer overall misses) 
because they have more stale data that must be matched 
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to other items valid at the same time. The 64 MB-sized 
cache’s workload is dominated by capacity misses, be- 
cause the cache is smaller than the working set. The 
disk-bound experiment sees more compulsory misses be- 
cause it has a larger dataset with limited locality, and few 
consistency misses because the update rate is slower. 

The low fraction of consistency misses suggests that 
providing consistency has little performance cost. We 
verified this experimentally by modifying our cache to 
continue to use our invalidation mechanism, but to read 
any data that was valid within the last 30 seconds, blithely 
ignoring consistency. The results of this experiment are 
shown as the “No consistency” line in Figure 5(a). As 
predicted, the benefit it provides over consistency is small. 
On the disk-bound configuration, the results could not be 
distinguished within experimental error. 


9 Related Work 


High performance web applications use many different 
techniques to improve their throughput. These range from 
lightweight application-level caches which typically do 
not provide transactional consistency, to database repli- 
cation systems that improve database performance while 
providing the same consistency guarantees, but do not 
address application server load. 


9.1 Application-Level Caching 


Applying caching at the application layer is an appeal- 
ing option because it can improve performance of both 
the application servers and the database. Dynamic web 
caches operate at the highest layer, storing entire web 
pages produced by the application, requiring them to be 
regenerated in their entirety when any content changes. 
These caches need to invalidate pages when the underly- 
ing data changes, typically by requiring the application to 
explicitly invalidate pages [37] or specify data dependen- 
cies [9, 38]. TxCache obviates this need by integrating 
with the database to automatically identify dependencies. 

However, full-page caching is becoming less appealing 
to application developers as more of the web becomes 
personalized and dynamic. Instead, web developers are 
increasingly turning to application-level data caches [4, 
16, 24, 26, 34] for their flexibility. These caches allow 
the application to choose what to store, including query 
results, arbitrary application data (such as Java or .NET 
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objects), and fragments of or whole web pages. 

These caches present to applications a 
GET/PUT/DELETE hash table interface, so the ap- 
plication developer must choose keys and correctly 
invalidate objects. As we argued in Section 2.1, this 
can be a source of unnecessary complexity and software 
bugs. Most application object caches have no notion of 
transactions, so they cannot ensure even that two accesses 
to the cache return consistent values. Some support 
transactions within the cache, allowing applications to 
atomically update objects in the cache [34, 16], but none 
maintain transactional consistency with the database. 


9.2 Database Replication 


Another popular alternative is to deploy a caching or repli- 
cation system within the database layer. These systems 
replicate the data tuples that comprise the database, and 
allow replicas to perform queries on them. Accordingly, 
they can relieve load on the database, but offer no benefit 
for application server load. 

Some replication systems guarantee transactional con- 
sistency by using group communication to execute 
queries [12, 19], which can be difficult to scale to large 
numbers of replicas [13]. Others offer weaker guarantees 
(eventual consistency) [11, 27], which can be difficult to 
reason about and use correctly. Still others require the 
developer to know the access pattern beforehand [3] or 
statically partition the data [8]. 

Most replication schemes used in practice take a pri- 
mary copy approach, where all modifications are pro- 
cessed at a master and shipped to slave replicas, usually 
asynchronously for performance reasons. Each replica 
then maintains a complete, if slightly stale, copy of the 
database. Several systems defer update processing to 
improve performance for applications that can tolerate 
limited amounts of staleness [6, 28, 30]. These protocols 
assume that each replica is a single, complete snapshot 
of the database, making them infeasible for use in an 
application object cache setting where it is not possible to 
maintain a copy of every object that could be computed. 
In contrast, TxCache’s protocol allows it to ensure con- 
sistency even though its cache contains cached objects 
that were generated at different times. 

Materialized views are a form of in-database caching 
that creates a view table containing the result of a query 
over one or more base tables, and updating it as the base 
tables change. Most work on materialized views seeks to 
incrementally update the view rather than recomputing 
it in its entirety [15]. This requires placing restrictions 
on view definitions, e.g. requiring them to be expressed 
in the select-project-join algebra. TxCache’s application- 
level functions, in addition to being computed outside 
the database, can include arbitrary computation, making 
incremental updates infeasible. Instead, it uses invalida- 


tions, which are easier for the database to compute [7]. 


10 Conclusion 


Application data caches are an efficient way to scale 
database-driven web applications, but they do not inte- 
grate well with databases or web applications. They break 
the consistency guarantees of the underlying database, 
making it impossible for the application to see a consis- 
tent view of the entire system. They provide a minimal 
interface that requires the application to provide signifi- 
cant logic for keeping cached values up to date, and often 
requires application developers to understand the entire 
system in order to correctly manage the cache. 

We provide an alternative with TxCache, an 
application-level cache that ensures all data seen by an 
application during a transaction is consistent, regardless 
of whether it comes from the cache or database. TxCache 
guarantees consistency by modifying the database server 
to return validity intervals, tagging data in the cache with 
these intervals, and then only retrieving values from the 
cache that were valid at a single point in time. By using 
validity intervals instead of single timestamps, TxCache 
can make the best use of cached data by lazily selecting 
the timestamp for each transaction. 

TxCache provides an easier programming model for 
application developers by allowing them to simply des- 
ignate cacheable functions, and then have the results of 
those functions automatically cached. The TxCache li- 
brary handles all of the complexity of managing the cache 
and maintaining consistency across the system: it selects 
keys, finds data in the cache consistent with the current 
transaction, and automatically detects and invalidates po- 
tentially changed objects as the database is updated. 

Our experiments with the RUBiS benchmark show that 
TxCache is effective at improving scalability even when 
the application tolerates only a small interval of staleness, 
and that providing transactional consistency imposes only 
a minor performance penalty. 
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Abstract 


Piccolo is a new data-centric programming model for 
writing parallel in-memory applications in data centers. 
Unlike existing data-flow models, Piccolo allows compu- 
tation running on different machines to share distributed, 
mutable state via a key-value table interface. Piccolo en- 
ables efficient application implementations. In particu- 
lar, applications can specify locality policies to exploit 
the locality of shared state access and Piccolo’s run-time 
automatically resolves write-write conflicts using user- 
defined accumulation functions. 

Using Piccolo, we have implemented applications for 
several problem domains, including the PageRank algo- 
rithm, k-means clustering and a distributed crawler. Ex- 
periments using 100 Amazon EC2 instances and a 12 
machine cluster show Piccolo to be faster than existing 
data flow models for many problems, while providing 
similar fault-tolerance guarantees and a convenient pro- 
gramming interface. 


1 Introduction 


With the increased availability of data centers and cloud 
platforms, programmers from different problem domains 
face the task of writing parallel applications that run 
across many nodes. These application range from ma- 
chine learning problems (k-means clustering, neural net- 
works training), graph algorithms (PageRank), scientific 
computation etc. Many of these applications extensively 
access and mutate shared intermediate state stored in 
memory. 

It is difficult to parallelize in-memory computation 
across many machines. As the entire computation is di- 
vided among multiple threads running on different ma- 
chines, one needs to coordinate these threads and share 
intermediate results among them. For example, to com- 
pute the PageRank score of web page p, a thread needs 
to access the PageRank scores of p’s “neighboring” web 
pages, which may reside in the memory of threads run- 
ning on different machines. Traditionally, parallel in- 


memory applications have been built using message- 
passing primitives such as MPI [21]. For many users, 
the communication-centric model provided by message- 
passing is too low-level an abstraction - they fundamen- 
tally care about data and processing data, as opposed to 
the location of data and how to get to it. 

Data-centric programming models [19, 27, 1], in 
which users are presented with a simplified interface 
to access data but no explicit communication mecha- 
nism, have proven a convenient and popular mecha- 
nism for expressing many computations. MapReduce 
and Dryad [27] provide a data-flow programming model 
that does not expose any globally shared state. While the 
data-flow model is ideally suited for bulk-processing of 
on-disk data, it is not a natural fit for in-memory compu- 
tation: applications have no online access to intermediate 
state and often have to emulate shared memory access by 
joining multiple data streams. Distributed shared mem- 
ory [29, 32, 7, 17] and tuple spaces [13] allow sharing of 
distributed in-memory state. However, their simple mem- 
ory (or tuple) model makes it difficult for programmers 
to optimize for good application performance in a dis- 
tributed environment. 

This paper presents Piccolo, a data-centric program- 
ming model for writing parallel in-memory applications 
across many machines. In Piccolo, programmers orga- 
nize the computation around a series of application ker- 
nel functions, where each kernel is launched as multi- 
ple instances concurrently executing on many compute 
nodes. Kernel instances share distributed, mutable state 
using a set of in-memory tables whose entries reside in 
the memory of different compute nodes. Kernel instances 
share state exclusively via the key-value table interface 
with get and put primitives. The underlying Piccolo run- 
time sends messages to read and modify table entries 
stored in the memory of remote nodes. 

By exposing shared global state, the programming 
model of Piccolo offers several attractive features. First, 
it allows for natural and efficient implementations for ap- 
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plications that require sharing of intermediate state such 
as k-means computation, n-body simulation, PageRank 
calculation etc. Second, Piccolo enables online applica- 
tions that require immediate access to modified shared 
state. For example, a distributed crawler can learn of 
newly discovered pages quickly as a result of state up- 
dates done by ongoing web crawls. 


Piccolo borrows ideas from existing data-centric sys- 
tems to enable efficient application implementations. 
Piccolo enforces atomic operations on individual key- 
value pairs and uses user-defined accumulation func- 
tions to automatically combine concurrent updates on 
the same key (similar to reduce functions in MapRe- 
duce [19]). The combination of these two techniques 
eliminates the need for fine-grained application-level 
synchronization for most applications. Piccolo allows 
applications to exploit locality of access to shared state. 
Users control how table entries are partitioned across ma- 
chines by defining a partitioning function [19]. Based 
on users’ locality policies, the underlying run-time can 
schedule a kernel instance where its needed table parti- 
tions are stored, thereby reducing expensive remote table 
access. 


We have built a run-time system consisting of one 
master (for coordination) and several worker processes 
(for storing in-memory table partitions and executing 
kernels). The run-time uses a simple work stealing 
heuristic to dynamically balance the load of kernel exe- 
cution among workers. Piccolo provides a global check- 
point/restore mechanism to recover from machine fail- 
ures. The run-time uses the Chandy-Lamport snapshot 
algorithm [15] to periodically generate a consistent snap- 
shots of the execution state without pausing active com- 
putations. Upon machine failure, Piccolo recovers by re- 
starting the computation from its latest snapshot state. 


Experiments have shown that Piccolo is fast and pro- 
vides excellent scaling for many applications. The per- 
formance of PageRank and k-means on Piccolo is 11x 
and 4x faster than that of Hadoop. Computing a PageR- 
ank iteration for a | billion-page web graph takes only 
70 seconds on 100 EC2 instances. Our distributed web 
crawler can easily saturate a 100 Mbps internet uplink 
when running on 12 machines. 


The rest of the paper is organized as follows. Sec- 
tion 2 provides a description of the Piccolo program- 
ming model, followed by the design of Piccolo’s run- 
time (Section 3). We describe the set of applications we 
constructed using Piccolo in Section 4. Section 5 dis- 
cusses our prototype implementation. We show Piccolo’s 
performance evaluation in Section 6 and present related 
work in Section 7. 
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2 Programming Model 


Piccolo’s programming environment is exposed as a li- 
brary to existing languages (our current implementation 
supports C++ and Python) and requires no change to un- 
derlying OS or compiler. This section describes the pro- 
gramming model in terms of how to structure application 
programs (§2.1), share intermediate state via key/value 
tables (§2.2), optimize for locality of access (§2.3), and 
recover from failures(§2.4). We conclude this section by 
showing how to implement the PageRank algorithm on 
top of Piccolo (§2.5). 


2.1 Program structure 


Application programs written for Piccolo consist of con- 
trol functions which are executed on a single machine, 
and kernel functions which are executed concurrently 
on many machines. Control functions create shared ta- 
bles, launch multiple instances of a kernel function, and 
perform global synchronization. Kernel functions consist 
of sequential code which read from and write to tables 
to share state among concurrently executing kernel in- 
stances. By default, control functions execute in a sin- 
gle thread and a single thread is created for executing 
each kernel instance. However, the programmer is free to 
create additional application threads in control or kernel 
functions as needed. 

Kernel invocation: The programmer uses the Run 
function to launch a specified number (mm) of kernel in- 
stances executing the desired kernel function on dif- 
ferent machines. Each kernel instance has an identifier 
0---m—1 which can be retrieved using the my_instance 
function. 

Kernel synchronization: The programmer invokes a 
global barrier from within a control function to wait for 
the completion of all previously launched kernels. Cur- 
rently, Piccolo does not support pair-wise synchroniza- 
tion among concurrent kernel instances. We found that 
global barriers are sufficient because Piccolo’s shared ta- 
ble interface makes most fine-grained locking operations 
unnecessary. This overall application structure, where 
control functions launch kernels across one or more 
global barriers, is reminiscent of the CUDA model [36] 
which also explicitly eschews support for pair-wise 
thread synchronization. 


2.2 Table interface and semantics 


Concurrent kernel instances share intermediate state 
across machine through key-value based in-memory ta- 
bles. Table entries are spread across all nodes and each 
key-value pair resides in the memory of a single node. 
Each table is associated with explicit key and value types 
which can be arbitrary user-declared serializable types. 
As Figure 1 shows, the key-value interface provides a 
uniform access model whether the underlying table en- 
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Table<Key, Value>: 
clear () 
contains (Key) 
get (Key) 
put (Key, Value) 


# updates the existing entry via 
# user-defined accumulation. 
update (Key, Value) 


# Commit any buffered updates/puts 
flush () 


# Return an iterator on a table partition 
get_iterator (Partition) 


Figure 1: Shared Table Interface 


try is stored locally or on another machine. The table 
APIs include standard operations such as get, put as 
well as Piccolo-specific functions like update, flush, 
get_iterator. Only control functions can create tables; 
both control and kernel functions can invoke any table 
operation. 

User-defined accumulation: Multiple kernel in- 
stances can issue concurrent updates to the same key. 
To resolve such write-write conflict, the programmer can 
associate a user-defined accumulation function with each 
table. Piccolo executes the accumulator during run-time 
to combine concurrent updates on the same key. If the 
programmer expects results to be independent from the 
ordering of updates, the accumulator must be a commu- 
tative and associative function [52]. 

Piccolo provides a set of standard accumulators such 
as summation, multiplication and min/max. To de- 
fine an accumulator, the user specifies four functions: 
Initialize to initialize an accumulator for a newly cre- 
ated key, Accumulate to incorporate the effect of a sin- 
gle update operation, Merge to combine the contents of 
multiple accumulators on the same key, and View to re- 
turn the current accumulator state reflecting all updates 
accumulated so far. Accumulator functions have no ac- 
cess to global state except for the corresponding table 
entry being updated. 

User-controlled Table Partitioning: Piccolo uses a 
user-specified partition function [19] to divide the key- 
space into partitions. Table partitioning is a key primitive 
for expressing user programs’ locality preferences. The 
programmer specifies the number of partitions (p) when 
creating a table. The p partitions of a table are named 
with integers 0...p — 1. Kernel functions can scan all en- 
tries in a given table partition using the get_iterator 
function (see Figure 1). 


Piccolo does not reveal to the programmer which node 
stores a table partition, but guarantees that all table en- 
tries in a given partition are stored on the same machine. 
Although the run-time aims to have a load-balanced as- 


signment of table partitions to machines, it is the pro- 
grammer’s responsibility to ensure that the largest table 
partition fits in the available memory of a single machine. 
This can usually be achieved by specifying a the number 
of partitions to be much larger than the number of ma- 
chines. 

Table Semantics: All table operations involving a sin- 
gle key-value pair are atomic from the application’s per- 
spective. Write operations (e.g. update, put) destined 
for another machine can be buffered to avoid blocking 
kernel execution. In the face of buffered remote writes, 
Piccolo provides the following guarantees: 


e All operations issued by a single kernel instance on 
the same key are applied in their issuing order. Op- 
erations issued by different kernel instances on the 
same key are applied in some total order [31]. 


e Upon a successful flush, all buffered writes done 
by the caller’s kernel instance will have been com- 
mitted to their respective remote locations, and will 
be reflected in the response to subsequent gets by 
any kernel instance. 


e Upon the completion of a global barrier, all ker- 
nel instances will have been completed and all their 
writes will have been applied. 


2.3 Expressing locality preferences 


While writes to remote table entries can be buffered at 
the local node, the communication latency involved in 
fetching remote table entries cannot be effectively hid- 
den. Therefore, the key to achieving good application 
performance is to minimize remote gets by exploiting 
locality of access. By organizing the computation as ker- 
nels and shared state as partitioned tables, Piccolo pro- 
vides a simple way for programmers to express local- 
ity policies. Such policies enable the underlying Piccolo 
run-time to execute a kernel instance on a machine that 
stores most of its needed data, thus minimizing remote 
reads. 

Piccolo supports two kinds of locality policies: (1) co- 
locate a kernel execution with some table partition, and 
(2) co-locate partitions of different tables. When launch- 
ing some kernel, the programmer can specify a table ar- 
gument in the Run function to express their preference 
for co-locating the kernel execution with that table. The 
programmer usually launches the same number of ker- 
nel instances as the number of partitions in the spec- 
ified table. The run-time schedules the i-th kernel in- 
stance to execute on the machine that stores the i-th par- 
tition of the specified table. To optimize for kernels that 
read from more than one table, the programmer uses the 
GroupTables (T1,T2, ..) function to co-locate multiple 
tables. The run-time assigns the i-th partition of T1,T2.... 
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to be stored on the same machine. As a result, by co- 
locating kernel execution with one of the tables, the pro- 
grammer can avoid remote reads for kernels that read 
from the same partition of multiple tables. 


2.4 User-assisted checkpoint and restore 


Piccolo handles machine failures via a global check- 
point/restore mechanism. The mechanism currently im- 
plemented is not fully automatic - Piccolo saves a con- 
sistent global snapshot of all shared table state, but relies 
on users to save additional information to recover the po- 
sition of their kernel and control function execution. We 
believe this design makes a reasonable trade-off. In prac- 
tice, the programming efforts required for checkpoint- 
ing user information are relatively small. On the other 
hand, our design avoids the overhead and complexities 
involved in automatically checkpointing C/C++ executa- 
bles. 

Based on our experience of writing applications, we 
arrived at two checkpointing APIs: one synchronous 
(CpBarrier) and one asynchronous (CpPeriodic). Both 
functions are invoked from some control function. Syn- 
chronous checkpoints are well-suited for iterative appli- 
cations (e.g. PageRank) which launch kernels in multiple 
rounds separated by global barriers and desire to save 
intermediate state every few rounds. On the other hand, 
applications with long running kernels (e.g. a distributed 
crawler) need to use asynchronous checkpoints to save 
their state periodically. 

CpBarrier takes as arguments a list of tables and a 
dictionary of user data to be saved as part of the check- 
point. Typical user data contain the value of some iterator 
in the control thread. For example in PageRank, the pro- 
grammer would like to record the number of PageRank 
iterations computed so far as part of the global check- 
point. CpBarrier performs a global barrier and ensures 
that the checkpointed state is equivalent to the state of 
execution at the barrier. 

CpPeriodic takes as arguments a list of tables, a time 
interval for periodic checkpointing, and a kernel call- 
back function CheckpointCallback. This callback is 
invoked for all active kernels on a node immediately after 
that node has checkpointed the state for its assigned ta- 
ble partitions. The callback function provides a way for 
the programmer to save the necessary data required to 
restore running kernel instances. Oftentimes this is the 
position of an iterator over the partition that is being 
processed by a kernel instance. When restoring, Piccolo 
reloads the table state on all nodes, and invokes kernel in- 
stances with the dictionary saved during the checkpoint. 


2.5 Putting it together: PageRank 


As a concrete example, we show how to implement 
PageRank using Piccolo. The PageRank algorithm [11] 
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tuple PageID(site, page) 
const PropagationFactor = 0.85 


def PRKernel (Table (PageID,double) curr, 
Table (PageID,double) next, 
Table (PageID, [PageID]) graph_partition): 
for page, outlinks in 
graph. get_iterator (my_instance()): 
rank = curr[page] 
update = PropagationFactor * rank / len(outlinks) 
for target in outlinks: 
next.update (target, update) 


def PageRank (Config conf): 
graph = Table (PageID, [PageID]).init ("/dfs/graph") 
curr = Table(PageID, double) .init ( 
graph.numPartitions(), 
SumAccumulator, SitePartitioner) 


next = Table(PageID, double) .init ( 
graph.numPartitions (), 
SumAccumulator, SitePartitioner) 


GroupTables (curr, next, graph) 


if conf.restore(): 
last_iter = curr.restore_from_checkpoint () 
else: last_iter = 0 


# run 50 iterations 
for i in range(last_iter, 50) 

Run (PRKernel, 
instances=curr_pr.numPartitions(), 
locality=LOC_REQUIRED (curr), 
args=(curr, next, graph) ) 


# checkpoint every 5 iterations, storing the 
# current iteration alongside checkpoint data 
if is5 ==0: 
CpBarrier (tables=curr, 
{iteration=i}) 
else: Barrier () 


# the values accumulated into ’next’ become the 
# source values for the next iteration 
swap (curr, next) 


Figure 2: PageRank Implementation 


takes as input a sparse web graph and computes a rank 
value for each page. The computation proceeds in mul- 
tiple iterations: page i’s rank value in the k-th itera- 


tion (p\* ) is the sum of the normalized ranks of its in- 

coming neighbors in the previous iteration, i.e. p”) = 
(k-1) 

Lvjem, Tows[” where Out; denotes page j’s outgoing 

neighbors. 

The complete PageRank implementation in Piccolo is 
shown in Figure 2. The input web graph is represented 
as a set of outgoing links, page — target, for each page. 
The graph is loaded into the shared in-memory table 
(graph) from a distributed file system. For link graphs 
too large to fit in memory, Piccolo also supports a read- 
only DiskTable interface for streaming data from disk. 

The intermediate rank values are kept in two tables: 
curr for the ranks to be read in the current iteration, 
next for the ranks to be written. The control function 
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Figure 3: The interactions between master and workers in exe- 
cuting a Piccolo program. 


(PageRank) iteratively launches p PRKernel kernel in- 
stances where p is the number of table partitions in 
graph (which is identical to that of curr and next). The 
kernel instance i scans all pages in the i-th partition of 
graph. For each page — target link, the kernel instance 
reads the rank value of page in curr, and generates up- 
dates for next to increment target’s rank value for the 
next iteration. 

Since the program generates concurrent updates to 
the same key in next, it associates the Sum accumula- 
tor with next, which correctly combines updates as de- 
sired by the PageRank computation. The overall compu- 
tation proceeds in rounds using a global barrier between 
PRKerne1 invocations. 

To optimize for locality, the program groups tables 
graph, curr, next together and expresses preference for 
co-locating PRKerne1 executions with the curr table. As 
a result, none of the kernel instances need to perform any 
remote reads. In addition, the program uses the partition 
function, SitePartitioner, to assign the URLs in the 
same domain to the same partition. As pages in the same 
domain tend to link to one another frequently, such par- 
titioning significantly reduces the number of remote up- 
dates. 

Checkpointing/restoration is straightforward: the con- 
trol thread performs a synchronous checkpoint to save 
the next table every five iterations and loads the latest 
checkpointed table to recover from failure. 


3 System Design 


This section describes the run-time design for executing 
Piccolo programs on a large collection of machines con- 
nected via high-speed Ethernet. 


3.1 Overview 


Piccolo’s execution environment consists of one mas- 
ter process and many worker processes, each executing 


on a potentially different machine. Figure 3 illustrates 
the overall interactions among workers and the master 
when executing a Piccolo program. As Figure 3 shows, 
the master executes the user control thread by itself and 
schedules kernel instances to execute on workers. Addi- 
tionally, the master decides how table partitions are as- 
signed to workers. Each worker is responsible for storing 
assigned table partitions in its memory and handling ta- 
ble operations associated with those partitions. Having a 
single master does not introduce a performance bottle- 
neck: the master informs all workers of the current par- 
tition assignment so that workers need not consult the 
master to perform performance-critical table operations. 

The master begins the execution of a Piccolo pro- 
gram by invoking the entry function in the control thread. 
Upon each table creation API call, the master decides on 
a partition assignment. The master informs all workers 
of the partition assignment and each worker initializes 
its set of partitions, which are all empty at startup. Upon 
each Run API call to execute m kernel instances, the mas- 
ter prepares m tasks, one for each kernel instance. The 
master schedules these tasks for execution on workers 
based on user’s locality preferences. Each worker runs a 
single kernel instance at a time and notifies the master 
upon task completion. The master instructs each com- 
pleted worker to proceed with an additional task if it is 
available. Upon encountering a global barrier, the mas- 
ter blocks the control thread until all active tasks are fin- 
ished. 

During kernel execution, a worker buffers update op- 
erations destined for remote workers, combines them us- 
ing user-defined accumulators and flushes them to re- 
mote workers after a short timeout. To handle a get or 
put operation, the worker flushes accumulated updates 
on the same key before sending the operation to the 
remote worker. Each owner applies operations (includ- 
ing accumulated updates) in their received order. Piccolo 
does not perform caching but supports a limited form 
of pre-fetching: after each get_iterator API call, the 
worker pre-fetches a portion of table entries beyond the 
current iterator value. 

Two main challenges arise in the above basic design. 
First, how to assign tasks in a load-balanced fashion so as 
to reduce the overall wait time on global barriers? This is 
particularly important for iterative applications that incur 
a global barrier at each iteration of the computation. The 
second challenge is to perform efficient checkpointing 
and restoration of table state. In the rest of this Section, 
we detail how Piccolo addresses both challenges. 


3.2 Load-balanced Task Scheduling 


Basic scheduling without load-balancing works as fol- 
lows. At table creation time, the master assigns table par- 
titions to all workers using a simple round-robin assign- 
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ment for empty memory tables. For tables loaded from 
a distributed file, the master chooses an assignment that 
minimizes inter-rack transfer while keeping the number 
of partitions roughly balanced among workers. The mas- 
ter schedules m tasks according to the specified local- 
ity preference, namely, it assigns task i to execute on a 
worker storing partition i. 

This initial schedule may not be ideal. Due to hetero- 
geneous hardware configurations or variable-sized com- 
putation inputs, workers can take varying amounts of 
time to finish assigned tasks, resulting in load imbalance 
and non-optimal use of machines. Therefore, the run- 
time needs to load-balance kernel executions beyond the 
initial schedule. 

Piccolo’s scheduling freedom is limited by two con- 
straints: First, no running tasks should be killed. As a 
running kernel instance modifies shared table state, re- 
executing a terminated kernel instance requires perform- 
ing an expensive restore operation from a saved check- 
point. Therefore, once a kernel instance is started, it is 
better to let the task complete than terminating it halfway 
for re-scheduling. By contrast, MapReduce systems do 
not have this constraint [28] as reducers do not start ag- 
gregation until all mappers are finished. The second con- 
straint comes from the need to honor user locality pref- 
erences. Specifically, if a kernel instance is to be moved 
from one worker to another, its co-located table partitions 
must also be transferred across those workers. 

Load-balance via work stealing: Piccolo performs 
a simple form of load-balancing: the master observes 
the progress of different workers and instructs a worker 
(Widie) that has finished all its assigned tasks to steal a 
not-yet-started task i from the worker (Wpysy) with the 
most remaining tasks. We adopt the greedy heuristic of 
scheduling larger tasks first. To implement this heuristic, 
the master estimates the input size of each task by the 
number of keys in its corresponding table partition. The 
master collects partition size information from all work- 
ers at table loading time as well as at each global barrier. 
The master instructs each worker to execute its assigned 
tasks in decreasing order of estimated task sizes. Addi- 
tionally, the idle worker wijqje always steals the biggest 
task among Wpysy’s remaining tasks. 

Table partition migration: Because of user local- 
ity preference, worker wijgie needs to transfer the corre- 
sponding table partition i from wp,sy before it executes 
stolen task i. Since table migration occurs while other 
active tasks are sending operations to partition i, Piccolo 
must take care not to lose, re-order or duplicate opera- 
tions from any worker on a given key in order to pre- 
serve table semantics. Piccolo uses a multi-phase migra- 
tion process that does not require suspending any active 
tasks. 

The master coordinates the process of migrating parti- 
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tion i from w, to wy, which proceeds in two phases. In the 
first phase, the master sends message M, to all workers 
indicating the new ownership of i. Upon receiving M1, 
all workers flush their buffered operations for i to wg and 
begin to send subsequent requests for i to w,. Upon the 
receipt of M,, wg “pauses” updates to i, and begins to 
forward requests received from other workers for i to wp. 
Wa then transfers the paused state for i to w,. During this 
phase, worker wy, buffers all requests for i received from 
Wa or other workers but does not yet handle them. 

After the master has received acknowledgments from 
all workers that the first phase is complete, it sends M2 to 
W, and wy, to complete migration. Upon receiving M2, wa 
flushes any pending operations destined for i to w, and 
discards the paused state for partition i. w, first handles 
buffered operations received from w, in order and then 
resumes normal operation on partition i. 

As can be seen, the migration process does not block 
any update operations and thus incurs little latency over- 
head for most kernels. The normal checkpoint/recovery 
mechanism is used to cope with faults that might occur 
during migration. 


3.3 Fault Tolerance 


Piccolo relies on user-assisted checkpoint and restore to 
cope with both master and worker failures during pro- 
gram execution. The Piccolo run-time saves a checkpoint 
of program state (including tables and other user-data) on 
a distributed file system and restores from the latest com- 
pleted checkpoint to recover from a failure. 

Checkpoint: Piccolo needs to save a consistent global 
checkpoint with low overhead. To ensure consistency, 
Piccolo must determine a global snapshot of the program 
state. To reduce overhead, the run-time must carry out 
checkpointing in the face of actively running kernel in- 
stances or the control thread. 

We use the Chandy-Lamport (CL) distributed snap- 
shot algorithm [15] to perform checkpointing. To save a 
CL snapshot, each process records its own state and two 
processes incident on a communication channel cooper- 
ate to save the channel state. In Piccolo, channel state 
can be efficiently captured using only table modification 
messages as kernels communicate with each other exclu- 
sively via tables. 

To begin a checkpoint, the master chooses a new 
checkpoint epoch number (£)) and sends the start check- 
point message Start, to all workers. Upon receiving the 
start message, worker w immediately takes a snapshot 
of the current state of its responsible table partitions and 
buffers future table operations (in addition to applying 
them). Once the table partitions in the snapshot are writ- 
ten to stable storage, w sends the marker message Mz, 
to all other workers. Worker w then enters a logging 
state in which it logs all buffered operations to a replay 


USENIX Association 


USENIX Association 


file. Once w has received markers from all other workers 
(Mg w,Vw! #w), it writes the replay log to stable storage 
and sends Fing to the master. The master considers the 
checkpointing done once it has received Fingy from all 
workers. 

For asynchronous checkpoints, the master initiates 
checkpoints periodically based on a timer. To record 
user-data consistently with recorded table state, each 
worker atomically takes a snapshot of table state and in- 
vokes the checkpoint callback function to save any ad- 
ditional user state for its currently running kernel in- 
stance. Synchronous checkpoints provide the semantics 
that checkpointed state is equivalent to those immedi- 
ately after the global barrier. Therefore, for synchronous 
checkpointing, each worker waits until it has completed 
all its assigned tasks before sending the checkpoint 
marker Mz y to all other workers. Furthermore, the mas- 
ter saves user-data in the control thread only after it has 
received Fing , from all workers. There is a trade-off in 
deciding when to start a synchronous checkpoint. If the 
master starts the checkpoint too early, e.g. while workers 
still have many remaining tasks, replay files become un- 
necessarily large. On the other hand, if the master delays 
checkpointing until all workers have finished, it misses 
opportunities to overlap kernel computation with check- 
pointing. Piccolo uses a heuristic to balance this trade- 
off: the master begins a synchronous checkpoint as soon 
as one of the workers has finished all its assigned tasks. 

To simplify the design, the master does not initiate 
checkpointing while there is active table migration and 
vice-versa. 

Restore: Upon detecting any worker failure, the mas- 
ter resets the state of all workers and restores compu- 
tation from the last completed global checkpoint. Pic- 
colo does not checkpoint the internal state of the mas- 
ter - if the master is restarted, restoration occurs as nor- 
mal, however, the replacement master is free to choose 
a different partition assignment and task schedule during 
restoration. 


4 More Applications 


In addition to PageRank, we have implemented four 
other applications: a distributed web crawler, k-means, 
n-body, matrix multiplication. This section summarizes 
how Piccolo’s programming model enables efficient im- 
plementation for these applications. 


4.1 Distributed Web Crawler 


Apart from iterative computations such as PageRank, 
Piccolo can be used by applications to distribute and co- 
ordinate fine-grained tasks among many machines. To 
demonstrate this usage, we implemented a distributed 
web crawler. The basic crawler operation is simple: be- 
ginning from a few initial URLs, the crawler repeatedly 


#local variables kept by each kernel instance 
fetch_pool = Queue () 
crawl_output = OutputLog(’ ./crawl.data’ ) 


def FetcherThread(): 
while 1: 
url = fetch_pool.get () 
txt = download_url (url) 
crawl_output.add(url, txt) 


for 1 in get_links (txt): 
url_table.update(1l, ShouldFetch) 
url_table.update (url, Done) 


def CrawlKernel (Table (URL,CrawlState) url_table): 
for i in range (20) 
t = FetcherThread () 


t.start () 
while 1: 
for url, status in url_table.my_partition : 
if status == ShouldFetch 


#omit checking domain in robots table 
#omit checking domain in politeness table 
url_table.update (url, Fetching) 
fetch_pool.add (url) 


Figure 4: Snippet of the crawler implementation. 


downloads a page and parses it to discover new URLs 
to fetch. A practical crawler must also satisfy other im- 
portant constraints: (1) honor the robots.txt file of each 
web site, (2) refrain from overwhelming a site by cap- 
ping fetches to a site at a fixed rate, and (3) avoid repeated 
fetches of the same URL. 

Our implementation uses three co-located tables: 


e The url_table stores the crawling state ToFetch, 
Fetching, Blacklisted, Done for each URL. For each 
URL p in ToFetch state, the crawler fetches the cor- 
responding web page and sets p’s state to Fetching. 
After the crawler has finished parsing p and extract- 
ing its outgoing links, it sets p’s state to Done. 


e The politeness table tracks the last time a page was 
downloaded for each site. 


e The robots table stores the processed robots file for 
each site. 


The crawler spawns m kernel instances, one for each 
machine. Our implementation is done in Python in order 
to utilize Python’s web-related libraries. Figure 4 shows 
the simplified crawler kernel (omitting details for pro- 
cessing robots.txt and capping per-site download rate). 
Each kernel scans its local url_table partitions to find 
ToFetch URLs and processes them using a pool of helper 
threads. As all three tables are partitioned according to 
the SitePartitioner function and co-located with each 
other, a kernel instance can efficiently check for the 
politeness information and robots entries before down- 
loading a URL. Our implementation uses the max ac- 
cumulator to resolve write-write conflicts on the same 
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URL in url_table according to Done > Blacklisted > 
Fetching > ToFetch. This allows the simple and ele- 
gant operation shown in Figure 4, where kernels re- 
discovering an already-fetched URL p can request up- 
dating p’s state to ToFetch and still arrive at the correct 
state for p. 

Consistent global checkpointing is important for the 
crawler’s recovery. Without global checkpointing, the re- 
covered crawler may find a page p to be Done but does 
not see any of p’s extracted links in the url_table, pos- 
sibly causing those URLs to never be crawled. Our im- 
plementation performs asynchronous checkpointing ev- 
ery 10 minutes so that the crawler loses no more than 10 
minutes worth of progress due to node failure. Restoring 
from the last checkpoint can result in some pages being 
crawled more than once (those lost since the last check- 
point), but the checkpoint mechanism guarantees that no 
pages will “fall through the cracks.” 


4.2 Parallel computation 


k-means. The k-means algorithm is an iterative com- 
putation for grouping n data points into k clusters in a 
multi-dimensional space. Our implementation stores the 
assigned centers for data points and the positions of cen- 
ters in shared tables. Each kernel instance processes a 
subset of data points to compute new center assignments 
for those data points and update center positions for the 
next iteration using the summation accumulator. 

n-body. This application simulates the dynamics of a 
set of particles over many discrete time-steps. We im- 
plemented an n-body simulation intended for short dis- 
tances [43], where particles further than a threshold dis- 
tance (r) apart are assumed to have no effect on each 
other. During each time-step, a kernel instance processes 
a subset of particles: it updates a particle’s velocity and 
position based on its current velocity and the positions of 
other particles within r distance away. Our implementa- 
tion uses a partition function to divide space into cubes 
so that a kernel instance mostly performs local reads in 
order to retrieve those particles within r distance away. 

Matrix multiplication. Computing C = AB where A 
and B are two large matrices is a common primitive in 
numerical linear algebra. The input and output matri- 
ces are divided into m x m blocks stored in three tables. 
Our implementation co-locates tables A,B,C. Each ker- 
nel instance processes a partition of table C by computing 
Cij = Le Aik Br,j- 


5 Implementation 


Piccolo has been implemented in C++. We provide both 
C++ and Python APIs so that users can write kernel 
and control functions in either C++ or Python. We use 
SWIG [6] for constructing a Python interface to Pic- 
colo. Our implementation re-uses a number of existing 
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libraries, such as OpenMPI for communication, Google’s 
protocol buffers for object serialization, and LZO for 
compressing on-disk tables. 

All the parallel computations (PageRank, k-means, n- 
body and matrix multiplication) are implemented using 
the C++ Piccolo API. The distributed crawler is imple- 
mented using the Python API. 


6 Evaluation 


We tested the performance of Piccolo on the applica- 
tions described in Section 4. Some applications, such as 
PageRank and k-means, can also be implemented using 
the existing data-flow model and we compared the per- 
formance of Piccolo with that of Hadoop for these appli- 
cations. 

The highlights of our results are: 


e Piccolo is fast. PageRank and k-means are 11 x and 
4x faster than those on Hadoop. When compared 
against the results published for DryadLing [53], in 
which a PageRank iteration on a 900M page graph 
were performed in 69 seconds, Piccolo finishes an 
iteration for a 1B page graph in 70 seconds on EC2, 
while using 1/5 the number of CPU cores. 


e Piccolo scales well. For all applications evaluated, 
increasing the number of workers shows a nearly 
linear reduction in the computation time. Our 100- 
instance EC2 experiment on PageRank also demon- 
strates good scaling. 


e Piccolo can help a non-conventional application like 
the crawler to achieve good parallel performance. 
Our crawler, despite being implemented in Python, 
manages to saturate the Internet bandwidth of our 
cluster. 


6.1 Test Setup 


Most experiments were performed using our local clus- 
ter of 12 machines: 6 of the machines have | quad- 
core Intel Xeon X3360 (2.83GHz) processor with 4GB 
memory, the other 6 machines have 2 quad-core Xeon 
E5520 (2.27GHz) processors with 8GB memory. All 
machines are connected via a commodity gigabit ether- 
net switch. Our EC2 experiments involve 100 “large in- 
stances” each with 7.5GB memory and 2 “virtual cores” 
where each virtual core is equivalent to a 2007-era single 
core 2.5GHz Intel Xeon processor. In all experiments, 
we created one worker process per core and pinned each 
worker to use that core. 

For scaling experiments, we vary the input size of dif- 
ferent applications. Table 5 shows the default and max- 
imum input size used for each application. We generate 
the web link graph for PageRank based on the statistics 
of a web graph of 100M pages in UK[9]. Specifically, we 
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Figure 6: Scaling performance (fixed default input size) 


extract the distributions for the number of pages in each 
site and the ratio of intra/inter-site links. We generate a 
web graph of any size by sampling from the site size dis- 
tribution until the desired number of pages is reached; 
outgoing links are then generated for each page in a site 
based on the distribution of the ratio of intra/inter-site 
links. For other applications, we use randomly generated 
inputs. 


6.2 Scaling Performance 


Figure 6 shows application speedup as the number of 
workers (NV) increases from 8 to 64 for the default input 
size. All applications are CPU-bound and exhibit good 
speedup with increasing N. Ideally, all applications (ex- 
cept for PageRank) have perfectly balanced table par- 
titions and should achieve linear speedup. However, to 
have reasonable running time at N=8, we choose a rela- 
tively small default input size. Thus, as N increases to 
64, Piccolo’s overhead is no longer negligible relative 
to applications’ own computation (e.g. k-means finishes 
each iteration in 1.4 seconds at N=64), resulting in 20% 
less than ideal speedup. PageRank’s table partitions are 
not balanced and work stealing becomes important for its 
scaling (see § 6.5). 

We also evaluate how applications scale with increas- 
ing input size by adjusting input size to keep the amount 
of computation per worker fixed with increasing N. We 
scale the input size linearly with N for PageRank and k- 
means. For matrix multiplication, the edge size increases 
as O(N!/3), We do not show results for n-body because it 
is difficult to scale input size to ensure a fixed amount of 
computation per worker. For these experiments, the ideal 
scaling has constant running time as input size increases 
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Figure 8: Scaling input size on EC2. 


with N. As Figure 7 shows, the achieved scaling for all 
applications is within 20% of the ideal number. 


6.3 EC2 


We investigated how Piccolo scales with a larger number 
of machines using 100 EC2 instances. Figure 8 shows 
the scaling of PageRank and k-means on EC2 as we in- 
crease their input size with N. We were somewhat sur- 
prised to see that the resulting scaling on EC2 is bet- 
ter than achieved on our small local testbed. Our local 
testbed’s CPU performance exhibited quite some vari- 
ability, impacting scaling. After further investigation, we 
believe the source for such variability is likely due to dy- 
namic CPU frequency scaling. 

At N=200, PageRank finishes in 70 seconds for a 1B 
page link graph. On a similar sized graph (900M pages), 
our local testbed achieves comparable performance ( 80 
seconds) with many fewer workers (N=64), due to the 
higher performing cores on our local testbed. 


6.4 Comparison with Other Frameworks 


Comparison with Hadoop: We implemented PageRank 
and k-means in Hadoop to compare their performance 
against that of Piccolo. The rest of our applications, in- 
cluding the distributed web crawler, n-body and matrix 
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multiplication, do not have any straightforward imple- 
mentation with Hadoop’s data-flow model. 

For the Hadoop implementation of PageRank, as with 
Piccolo, we partition the input link graph by site. Dur- 
ing execution, each map task has locality with the parti- 
tion of graph it is operating on. Mappers join the graph 
and PageRank score inputs, and use a combiner to aggre- 
gate partial results. Our Hadoop k-means implementation 
is highly optimized. Each mapper fetches all 100 cen- 
troids from the previous iteration via Hadoop File Sys- 
tem (HDFS), computes the cluster assignment of each 
point in its input stream, and uses a local hash map to ag- 
gregate the updates for each cluster. As a result, a reducer 
only needs to aggregate one update from each mapper to 
generate the new centroid. 

We made extensive efforts to optimize the perfor- 
mance of PageRank and k-means on Hadoop including 
changes to Hadoop itself. Our optimizations include us- 
ing raw memory comparisons, using primitive types to 
avoid Java’s boxing and unboxing overhead, disabling 
checksumming, improving Hadoop’s join implementa- 
tion etc. Figure 9 shows the running time of Piccolo 
and Hadoop using the default input size. Piccolo signif- 
icantly outperforms Hadoop on both benchmarks (11x 
for PageRank and 4x for k-means with N=64). The 
performance difference between Hadoop and Piccolo is 
smaller for k-means because of our optimized k-means 
implementation; the structure of PageRank does not ad- 
mit a similar optimization. 

Although we expected to see some performance dif- 
ference because Hadoop is implemented in Java while 
Piccolo in C++, the order of magnitude difference came 
as a surprise. We profiled the PageRank implementation 
on Hadoop to find the contributing factors. The leading 
causes for the slowdown are: (1) sorting keys in the map 
phase (2) serializing and de-serializing data streams and 
(3) reading and writing to HDFS. Key sorting alone ac- 
counted for nearly 50% of the runtime in the PageR- 
ank benchmark, and serialization another 15%. In con- 
trast, with Piccolo, the need for (1) is eliminated and 
the overhead associated with (2) and (3) is greatly re- 
duced. PageRank rank values are stored in memory and 
are available across iterations without being serialized to 
a distributed file system. In addition, as most outgoing 
links point to other pages at the same site, a kernel in- 
stance ends up performing most updates directly to lo- 
cally stored table data, thereby avoiding serialization for 
those updates entirely. 

Comparison with MPI: We compared the the perfor- 
mance of matrix multiplication using Piccolo to a third- 
party MPI-based implementation [2]. The MPI version 
uses Cannon’s algorithm for blocked matrix multiplica- 
tion and uses MPI specific communication primitives to 
handle data broadcast and the simultaneous sending and 
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Figure 9: Per-iteration running time of PageRank and k-means 
in Hadoop and Piccolo (fixed default input size). 
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Figure 10: Runtime of matrix multiply, scaled relative to MPI. 


receiving of data. For Piccolo, we implemented the naive 
blocked multiplication algorithm, using our distributed 
tables to handle the communication of matrix state. As 
Piccolo relies on MPI primitives for communication, we 
do not expect to see performance advantage, but are 
more interested in quantifying the amount of overhead 
incurred. 

Figure 10 shows that the running time of the Piccolo 
implementation is no more than 10% of the MPI imple- 
mentation. We were surprised to see that our Piccolo im- 
plementation out-performed the MPI version in exper- 
iments with more workers. Upon inspection, we found 
that this was due to slight performance differences be- 
tween machines in our cluster; as the MPI implementa- 
tion has many more synchronization points than that of 
Piccolo, it is forced to wait for slower nodes to catch up. 


6.5 Work Stealing and Slow Machines 


The PageRank benchmark provides a good basis for test- 
ing the effect of work stealing because the web graph par- 
titions have highly variable sizes: the largest partition for 
the 900M-page graph is 5 times the size of the smallest. 
Using the same benchmark, we also tested how perfor- 
mance changed when one worker was operating slower 
then the rest. To do so, we ran a CPU-intensive program 
on one core that resulted in the worker bound to that core 
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Figure 11: Effect of Work Stealing and Slow Workers 


having only 50% of the CPU time of the other workers. 

The results of these tests are shown in Figure 11. Work 
stealing improves running time by 10% when all ma- 
chines are operating normally. The improvement is due 
to the imbalance in the input partition sizes - when run 
without work stealing, the computation waits longer for 
the workers processing more data to catch up. 

The effect of slow workers on the computation is more 
dramatic. With work-stealing disabled, the runtime is 
nearly double that of the normal computation, as each 
iteration must wait for the slowest worker to complete 
all assigned tasks. Enabling work stealing improves the 
situation dramatically - the computation time is reduced 
to less then 5% over that of the non-slow case. 


6.6 Checkpointing 


We evaluated the checkpointing overhead using the 
PageRank, k-means and n-body problems. Compared to 
the other problems, PageRank has a larger table that 
needs to be checkpointed, making it a more demand- 
ing test of checkpoint/restore performance. In our ex- 
periment, each worker wrote its checkpointed table par- 
titions to the local disk. Figure 12 shows the runtime 
when checkpointing is enabled relative to when there 
is no checkpointing. For the naive synchronous check- 
pointing strategy, the master starts checkpointing only 
after all workers have finished. For the optimized strat- 
egy, the master initiates the checkpoint as soon as one of 
the workers has finished. As the figure shows, overhead 
of the optimized checkpointing strategy is quite negligi- 
ble (~2%) and the optimization of starting checkpointing 
early results in significant reduction of overhead for the 
larger PageRank checkpoint. 

Limitations of global checkpoint and restore: The 
global nature of Piccolo’s failure recovery mechanism 
raises the question of scalability. As the of a cluster in- 
creases, failure becomes more frequent; this causes more 
frequent checkpointing and restoration which consume a 
larger fraction of the overall computation time. While we 
lacked the machine resources to directly test the perfor- 
mance of Piccolo on thousands of machines, we estimate 
scalability limit of Piccolo’s checkpointing mechanism 
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Figure 12: Checkpoint overhead. Per-iteration runtime is scaled 
relative to without checkpointing. 








6 MTBF 

© 10qrs2n2 

> | <3 YOarSmn me me ee 
E oe 1 year 

806 

g 0.47 3 weeks 

S 0.27 

3 OF | | | | 

a 0 2000 4000 6000 8000 


Machines 


Figure 13: Expected scaling for large clusters. 


based on expected machine uptime. 

We consider a hypothetical cluster of machines with 
16GB of RAM and 4 disk drives. We measured the time 
taken to checkpoint and restore such a machine in the 
“worst case” - a computation whose table state uses all 
available system memory. We estimate the fraction of 
time a Piccolo computation would spend working pro- 
ductively (not in a checkpoint or restore state), for vary- 
ing numbers of machines and failure rates. In our model, 
we assume that machine failures arrive at a constant in- 
terval defined by the failure rate and the number of ma- 
chines in a cluster. While this is a simplification of real- 
life failure behavior, it is a worst-case scenario for the 
restore mechanism, and as such provides a useful lower 
bound. The expected efficiency based on our model is 
shown in Figure 13. For well maintained data-centers 
that we are familiar with, the average machine uptime is 
typically around | year. For these data-centers, the global 
checkpointing mechanism can efficiently scale up to a 
few thousand machines. 


6.7 Distributed Crawler 


We evaluated our distributed crawler implementation us- 
ing various numbers of workers. The URL table was ini- 
tialized with a seed set of 1000 URLs. At the end of a 30 
minutes run of the experiment, we measured the num- 
ber of pages crawled and bytes downloaded. Figure 14 
shows the crawler’s web page download throughput in 
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Figure 14: Crawler throughput 


MBytes/sec as N increases from 1 to 64. The crawler 
spends most CPU time in the Python code for pars- 
ing HTML and URLs. Therefore, its throughput scales 
approximately linearly with N. At N=32, the crawler 
download throughput peaks at ~ L|OMB/s which is limited 
by our 100-Mbps Internet uplink. There are highly op- 
timized single-server crawler implementations that can 
sustain higher download rates than 100Mbps [49]. How- 
ever, our Piccolo-based crawler could potentially scale 
to even higher download rates despite being built using 
Python. 


7 Related Work 


Communication-oriented models: Communication- 
based primitives such as MPI [21] and Parallel Virtual 
Machine (PVM [46]) have been popular for construct- 
ing distributed programs for many years. MPI and PVM 
offer extensive messaging mechanisms including unicast 
and broadcast as well as support for creating and manag- 
ing remote processes in a distributed environment. There 
has been continuous research on developing experimen- 
tal features for MPI, such as optimization of collective 
operations [3], fault-tolerance via machine virtualiza- 
tion [34] and the use of hybrid checkpoint and logging 
for recovery [10]. MPI has been used to build very high 
performance applications - its support of explicit com- 
munication allows considerable flexibility in writing ap- 
plications to take advantage of a wide variety of network 
topologies in supercomputing environments. This flexi- 
bility has a cost in the form of complexity - users must 
explicitly manage communication and synchronization 
of state between workers, which can become difficult to 
do while attempting to retain efficient and correct execu- 
tion. 

BSP (Bulk Synchronous Parallel) is a high-level 
communication-oriented model [50]. In this model, 
threads execute on different processors with local mem- 
ory, communicate with each other using messages, and 
perform global-barrier synchronization. BSP implemen- 
tations are typically realized using MPI [25]. Recently, 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


the BSP model has been adopted in the Pregel framework 
for parallelizing work on large graphs [33]. 

Distributed shared-memory: The complexity of pro- 
gramming for communication-oriented models drove a 
wave of research in the area of distributed shared mem- 
ory (DSM) systems [30, 29, 32, 7]. Most DSM systems 
aim to provide transparent memory access, which causes 
programs written for DSMs to incur many fine-grained 
synchronization events and remote memory reads. While 
initially promising, DSM research has fallen off as the 
ratio of network latency to local CPU performance has 
widened, making naive remote accesses and synchro- 
nization prohibitively expensive. 

Parallel Global Address Space (PGAS) [17, 35, 51] 
are a set of language extensions to realize a distributed 
shared address space. These extensions try to ameliorate 
the latency problems of DSM by allowing users to ex- 
press affinities of portions of shared memory with a par- 
ticular thread, thereby reducing the frequency of remote 
memory references. They retain the low level (flat mem- 
ory) interface common to DSM. As a result, applica- 
tions written for PGAS systems still require fine-grained 
synchronization when operating on non-primitive data- 
types, or in order to aggregate several values (for in- 
stance, computing the sum of a memory location with 
multiple writers). 

Tuple spaces, as seen in coordination languages such 

as Linda [13] and more recently JavaSpaces [22], expose 
to users a global tuple-space accessible from all partic- 
ipating threads. Although tuple spaces provide atomic 
primitives for reading and writing tuples, they are not in- 
tended for high-frequency access. As such, there is no 
support for locality optimization nor write-write conflict 
resolution. 
MapReduce and Dataflow models: In recent years, 
MapReduce has emerged as a popular programming 
model for parallel data processing [19]. There are many 
recent efforts inspired by MapReduce ranging from gen- 
eralizing MapReduce to support the join operation [27], 
improving MapReduce’s pipelining performance [16], 
building high-level languages on top of MapReduce (e.g. 
DryadLINQ [53], Hive [48], Pig [37] and Sawzall [40]). 
FlumeJava [14] provides a set of collection abstractions 
and parallel execution primitives which are optimized 
and compiled down to a sequence of MapReduce opera- 
tions. 

The programming models of MapReduce [19] and 
Dryad [27] are instances of stream processing, or 
data-flow models. Because of MapReduce’s popularity, 
programmers start using it to build in-memory itera- 
tive applications such as PageRank, even though the 
data-flow model is not a natural fit for these appli- 
cations. Spark [54] proposes to add distributed read- 
only in-memory cache to improve the performance of 
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MapReduce-based iterative computations. 
Single-machine shared memory models: Many pro- 
gramming models are available for parallelizing exe- 
cution on a single machine. In this setting, there ex- 
ists a physically-shared memory among computing cores 
supporting low-latency memory access and fast syn- 
chronization between threads of computation, which are 
not available in a distributed environment. Although 
there are also popular streaming/data-flow models [44, 
47, 12], most parallel models for a single machine 
are based on shared-memory. For the GPU platform, 
there are CUDA [36] and OpenCL [24]. For multi- 
core CPUs, Cilk [8] and more recently, Intel’s Thread 
Building Blocks [41] provide support for low-overhead 
thread creation and dispatching of tasks at a fine level. 
OpenMP [18] is a popular shared-memory model among 
the scientific computing community: it allows users 
to target sections of code for parallel execution and 
provides synchronization and reduction primitives. Re- 
cently, there have been efforts to support OpenMP pro- 
grams across a cluster of machines [26, 5]. However, 
based on software distributed shared memory, the result- 
ing implementations suffer from the same limitations of 
DSMs and PGAS systems. 

Distributed data structures: The goal of distributed 
data structures is to provide a flexible and scalable data 
storage or caching interface. Examples of these include 
DDS [23], Memcached [39], the recently proposed Ram- 
Cloud [38], and many key-value stores based on dis- 
tributed hash tables [4, 20, 45, 42]. These systems do 
not seek to provide a computation model, but rather are 
targeted towards loosely-coupled distributed applications 
such as web serving. 


8 Conclusion 


Parallel in-memory application need to access and share 
intermediate state that reside on different machines. 
Piccolo provides a programming model that supports 
the sharing of mutable, distributed in-memory state via 
a key/value table interface. Piccolo helps applications 
achieve high performance by optimizing for locality of 
access to shared state and having the run-time auto- 
matically resolve write-write conflicts using application- 
specified accumulation functions. 
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Abstract 


The paper describes the design, implementation, and 
evaluation of Depot, a cloud storage system that mini- 
mizes trust assumptions. Depot tolerates buggy or mali- 
cious behavior by any number of clients or servers, yet it 
provides safety and liveness guarantees to correct clients. 
Depot provides these guarantees using a two-layer archi- 
tecture. First, Depot ensures that the updates observed by 
correct nodes are consistently ordered under Fork-Join- 
Causal consistency (FJC). FJC is a slight weakening of 
causal consistency that can be both safe and live despite 
faulty nodes. Second, Depot implements protocols that 
use this consistent ordering of updates to provide other 
desirable consistency, staleness, durability, and recovery 
properties. Our evaluation suggests that the costs of these 
guarantees are modest and that Depot can tolerate faults 
and maintain good availability, latency, overhead, and 
staleness even when significant faults occur. 


1 Introduction 


This paper describes the design, implementation, and 
evaluation of Depot, a cloud storage system in the spirit 
of S3 [1], Azure [4], and Google Storage [3] but with a 
crucial difference: Depot clients do not have to trust, that 
is assume, that Depot servers operate correctly. 

What motivates Depot is that cloud storage service 
providers (SSPs), such as $3 and Azure, are fault-prone 
black boxes operated by a party other than the data 
owner. Indeed, clouds can experience software bugs [9], 
correlated manufacturing defects [57], misconfigured 
servers and operator error [53], malicious insiders [68], 
bankruptcy [5], undiagnosed problems [14], Acts of God 
(e.g., fires [20]) and Man [50]. Thus, it seems prudent 
for clients to avoid strong assumptions about an SSP’s 
design, implementation, operation, and status—and in- 
stead to rely on end-to-end checks of well-defined prop- 
erties. In fact, removing such assumptions promises to 
help SSPs too: today, a significant barrier to adopting 
cloud services is precisely that many organizations hesi- 
tate to place trust in the cloud [18]. 

Given this motivation, Depot assumes less than any 
prior system about the correctness of participating hosts: 


¢ Depot eliminates trust for safety. A client can ensure 
safety by assuming the correctness of only itself. De- 
pot guarantees that any subset of correct clients ob- 
serves sensible, well-defined semantics. This holds 
regardless of how many nodes fail and no matter 


whether they are clients or servers, whether these 
are failures of omission or commission, and whether 
these failures are accidental or malicious. 


Depot minimizes trust for liveness and availability. 
We wish we could say “trust only yourself” for live- 
ness and availability. Depot does eliminate trust for 
updates: a client can always update any object for 
which it is authorized, and any subset of connected, 
correct clients can always share updates. However, for 
reads, there is a fundamental limit to what any storage 
system can guarantee: if no correct, reachable node 
has an object, that object may be unavailable. We cope 
with this fundamental limit by allowing reads to be 
served by any node (even other clients) while preserv- 
ing the system’s guarantees, and by configuring the 
replication policy to use several servers (which pro- 
tects against failures of clients and subsets of servers) 
and at least one client (which protects against tempo- 
rary [8] and permanent [5, 14] cloud failures). 


Though prior work has reduced trust assumptions in 
storage systems, it has not minimized trust with respect 
to safety, liveness, or both. For example, quorum and 
replicated state machine approaches [15, 19, 30] toler- 
ate failures by a fraction of servers. However, they sac- 
rifice safety when faults exceed a threshold and live- 
ness when too few servers are reachable. Fork-based 
systems [12, 13, 43, 44] remain safe without trusting a 
server, but they compromise liveness in two ways. First, 
if the server is unreachable, clients must block. Second, 
a faulty server can permanently partition correct clients, 
preventing them from ever observing each other’s subse- 
quent updates. 

Indeed, it is challenging to guarantee safety and live- 
ness while minimizing trust assumptions: without some 
assumptions about correct operation, providing even a 
weak guarantee like eventual consistency—the bare min- 
imum of what a storage service should provide—seems 
difficult. For example, a faulty storage node receiving an 
update from a correct client might quietly fail to prop- 
agate that update, thereby hiding it from the rest of the 
system. Perhaps surprisingly, we find that eventual con- 
sistency is possible in this environment. 

In fact, Depot meets a contract far stronger than even- 
tual consistency even under assorted and abundant faults 
and failures. This set of well-defined guarantees under 
weak assumptions is Depot’s top-level contribution, and 
it derives from a novel synthesis of prior mechanisms and 
our own. Depot is built around three key ideas: 
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(1) Reduce misbehavior to concurrency. As in prior 
work [12, 13, 43, 44], the protocol requires that an up- 
date be signed and that it name both its antecedents and 
the system state seen by the updater. Then, misbehavior 
by clients or servers is limited to forking: showing di- 
vergent histories to different nodes. However, previous 
work detects but does not repair forks. In contrast, De- 
pot allows correct clients to join forks, that is, to incorpo- 
rate the divergence into a sensible history, which allows 
them to keep operating in the face of faults. Specifically, 
a correct node regards a fork as logically concurrent up- 
dates by two virtual nodes. At that point, correct nodes 
can handle forking by faulty nodes using the same tech- 
niques [11, 23, 37, 61, 67] that they need anyway to han- 
dle a better understood problem: logically concurrent up- 
dates during disconnected operation. 

(2) Enforce Fork-Join-Causal consistency. To allow 
end-to-end checks on SSP behavior, we must specify 
a contract: When must an update be visible to a read? 
When is it okay for a read to “miss” a recent update? De- 
pot guarantees that a correct client observes Fork-Join- 
Causal consistency (FJC) no matter how many other 
nodes are faulty. FJC is a slight weakening of causal 
consistency [7, 40, 56]. Depot defines FJC as its consis- 
tency contract because it is weak enough to enforce de- 
spite faulty nodes and without hurting availability. At the 
same time, FJC is strong enough to be useful: nodes see 
each other’s updates in an order that reflects dependen- 
cies among both correct and faulty nodes’ writes. This 
ordering is useful not only for end users of Depot but 
also internally, within Depot. 

(3) Layer other storage properties over FJC. Depot 
implements a layered architecture that builds on the or- 
dering guarantees provided by FJC to provide other de- 
sirable properties: eventual consistency, bounded stale- 
ness, durability, high availability, integrity (ensuring that 
only authorized nodes can update an object), snapshot- 
ting of versions (to guard against spurious updates from 
faulty clients), garbage collection, and eviction of faulty 
nodes.! For all of these properties, the challenge is to 
precisely define the strongest guarantee that Depot can 
provide with minimal assumptions about correct opera- 
tion. Once each property is defined, implementation is 
straightforward because we can build on FJC, which lets 
us reason about the order in which updates propagate 
through the system. 

The price of providing these guarantees is tolerable, as 
demonstrated by an experimental evaluation of a proto- 
type implementation of Depot. Depot adds a few hundred 
bytes of metadata to each update and each stored object, 
and it requires a client to sign and store each of its up- 
dates. We demonstrate that Depot can tolerate faults and 


!We are not explicitly addressing confidentiality and privacy, but, 
as discussed in §3.1, existing approaches can be layered on Depot. 
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maintain good availability, latency, overhead, and stale- 
ness even when significant faults occur. Additionally, be- 
cause Depot makes minimal assumptions about servers, 
we can implement Teapot, a variation of Depot that pro- 
vides many of Depot’s guarantees using an unmodified 
SSP, such as Amazon’s S3. The difference between De- 
pot and Teapot suggests several modest extensions to 
SSPs’ interfaces that would strengthen their guarantees. 


2 Why untrusted storage? 


When we say that a component is untrusted, we are not 
adopting a “tinfoil hat” stance that the component is op- 
erated by a malicious actor, nor are we challenging the 
honesty of storage service providers. What we mean is 
that the system provides guarantees, usually achieved by 
end-to-end checks, even if the given component is in- 
correct. Since components could be incorrect for many 
reasons (as stated in the introduction), we believe that 
designing to tolerate incorrectness is prudence, not para- 
noia. We now answer some natural questions. 

SSPs are operated by large, reputable companies, so 
why not trust them? That is like asking, “Banks are large, 
reputable repositories of money, so why do we need bank 
statements?” For many reasons, customers and banks 
want customers to be able to check the bank’s view of 
their account activity. Likewise, our approach might ap- 
peal not only to customers but also to SSPs: by requiring 
less trust, a service might attract more business. 

How likely are faults in the SSP? We do not know 
the precise probability. However, we know that providers 
do fail (as mentioned in the introduction). More broadly, 
they carry non-negligible risks. First, they are opaque (by 
nature). Second, they are complex distributed systems. 
Indeed, coping with known hardware failure modes in 
local file systems is difficult [59]; in cloud storage, this 
difficulty can only grow. Given the opacity and complex- 
ity, it seems prudent not to assume the unfailing correct- 
ness of an SSP’s internals. 

Even if we do not assume that SSPs are perfect, the 
most likely failure is the occasional corrupted or lost 
block, which can be addressed with checksums and repli- 
cation. Do you really need mechanisms to handle other 
cases (that all of the nodes are faulty, that a fork happens, 
that old or out-of-order data is returned, etc.)? Replica- 
tion and checksums are helpful, and they are part of De- 
pot. However, they are not sufficient. First, failures are 
often correlated: as Vogels notes, uncorrelated failures 
are “absolutely unrealistic ...as [failures] are often trig- 
gered by external or environmental events” [69]. These 
events include the litany in the introduction. 

Second, other types of failures are possible. For exam- 
ple, a machine that loses power after failing to commit its 
output [52, 72] may lose recent updates, leading to forks 
in history. Or, a network failure might delay propagation 
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of an update from one SSP node to another, causing some 
clients to read stale data. In general, our position is that 
rather than try to handle every possible failure individu- 
ally, it is preferable to define an end-to-end contract and 
then design a system that always meets that contract. 

The above events seem unlikely. Is tolerating them 
worth the cost? One of our purposes in this paper is to 
report for the first time what that cost is. Whether to “pur- 
chase” the guarantees is up to the application, but as the 
price is modest, we anticipate, with hope, that many ap- 
plications will find it attractive. 

What about clients? We also minimize trust of clients 
(since they are, of course, also vulnerable to faults). 


3 Architecture, scope, and use 


Figure 1 depicts Depot’s high-level architecture. A set of 
clients stores key-value pairs on a set of servers. In our 
target scenario, the servers are operated by a storage ser- 
vice provider (SSP) that is distinct from the data owner 
that operates the clients.2 Keys and values are arbitrary 
strings, with overhead engineered to be low when values 
are at least a few KB. A Depot client exposes an interface 
of GET and PUT to its application users. 

For scalability, we slice the system into groups of 
servers, with each group responsible for one or more vol- 
umes. Each volume corresponds to a range of one cus- 
tomer’s keys, and a server independently runs the proto- 
col for each volume assigned to it. Many strategies for 
partitioning keys are possible [22, 36, 51], and we leave 


2Because Depot does not require nodes to trust each other, different 
data centers in Figure 1 could be operated by different SSPs. Doing so 
might reduce the risk of correlated failures across replicas [6, 38]. For 
simplicity, we describe and evaluate only single-SSP configurations. 


the assignment of keys to volumes to layers above Depot. 

The servers for each volume may be geographically 
distributed, a client can access any server, and servers 
replicate updates using any topology (chain, mesh, star, 
etc.). As in Dynamo [22], to maximize availability, De- 
pot does not require overlapping read and write quorums. 
In fact, as the dotted lines suggest, Depot can even func- 
tion under complete server unavailability: the protocol 
permits clients to communicate directly with each other. 
If the SSP later recovers, clients can continue using the 
SSP (after sending the missed updates to the servers). 
This raises a question: why have the SSP at all? We point 
to the usual benefits of cloud services: cost, scalability, 
geographic replication, and management. 

We use the term node to mean either a client or a 
server. Clients and servers run the same basic Depot pro- 
tocol, though they are configured differently. 


3.1 Issues addressed 


One of our aims in this work is to push the envelope in the 
trade-offs between trust assumptions and system guaran- 
tees. Specifically, for a set of standard properties that one 
might desire in a storage system, we ask: what is the min- 
imum assumption that we need to provide useful guaran- 
tees, and what are those guarantees? The issues that we 
examine are as follows: 


* Consistency (§4—-§5.2) and bounded staleness (§5.4): 
Once a write occurs, the update should be visible to 
reads “soon”. Consistency and staleness properties 
limit the extent to which the storage system can re- 
order, delay, or omit making updates visible to reads. 


¢ Availability and durability (§5.3): Our availability 
goal is to maximize the fraction of time that a client 
succeeds in reading or writing an object. Durability 
means that the system does not permanently lose data. 


¢ Integrity and authorization (§5.5): Only clients autho- 
rized to update an object should be able to create valid 
updates that affect reads on that object. 


¢ Data recovery (§5.6): Data owners care about end-to- 
end reliability. Consistency, durability, and integrity 
are not enough when the layers above Depot—faulty 
clients, applications, or users—can issue authorized 
writes that replace good data with bad. Depot does 
not try to distinguish good updates from bad ones, 
nor does it innovate on the abstractions used to de- 
fend data from higher-layer failures. We do however 
explore how Depot can support standard techniques 
such as snapshots to recover earlier versions of data. 


¢ Evicting faulty nodes (§5.7): If a faulty node provably 
deviates from the protocol, we wish to evict it from the 
system so that it will not continue to disrupt operation. 
However, we must never evict correct nodes. 
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Depot provides the above properties with a layered 
approach. Its core protocol (§4) addresses consistency. 
Specifically, the protocol enforces Fork-Join-Causal con- 
sistency (FJC), which is the same as causal consis- 
tency [7, 40, 56] in benign runs. This protocol is the 
essential building block for the other properties listed 
above. In §5, we define these properties precisely and 
discuss how Depot provides them. 

Note that we explicitly do not try to solve the confiden- 
tiality/privacy problem within Depot. Instead, like com- 
mercial storage systems [1, 4], Depot enforces integrity 
and authorization (via client signatures) but leaves it to 
higher layers to use appropriate techniques for the pri- 
vacy requirements of each application (e.g., allow global 
access, encrypt values, encrypt both keys and values, in- 
troduce artificial requests to thwart traffic analysis, etc.). 

We also do not claim that the above list of issues is ex- 
haustive. For example, it may be useful to audit storage 
service providers with black box tests to verify that they 
are storing data as promised [38, 62], but we do not ex- 
amine that issue. Still, we believe that the properties are 
sufficient to make the resulting system useful. 


3.2 Depot in use: Applications & conflicts 


Depot’s key-value store is a low-level building block over 
which many applications can be built. For example, hun- 
dreds of widely used applications—including backup, 
point of sale software, file transfer, investment analytics, 
cross-company collaboration, and telemedicine—use the 
S3 key-value store [2], and Depot can serve all of them: it 
provides a similar interface to S3, and it provides strictly 
stronger guarantees. 

An issue in systems that are causally consistent and 
weaker—a set that includes not just Depot and S3 
but also CVS, SVN, Git, Bayou [56], Coda [37], and 
others—is handling concurrent writes to the same object. 
Such conflicts are unfortunate but unavoidable: they are 
provably the price of high availability [26]. 

Many approaches to resolving conflicting updates 
have been proposed [37, 61, 67], and Depot does not 
claim to extend the state of the art on this front. In fact, 
Depot is less ambitious than some past efforts: rather 
than try to resolve conflicts internally (e.g., by picking 
a winner, merging concurrent updates, or rolling back 
and re-executing transactions [67]), Depot simply ex- 
poses concurrency when it occurs: a read of key k returns 
the set of updates to k that have not been superseded by 
any logically later update of k.* 

This approach is similar to that of S3’s replication 


3Note that Depot neither creates concurrency nor makes the prob- 
lem worse. If an application cannot deal with conflicts, it can still use 
Depot but must restrict its use (e.g., by adding locks and sending all 
operations through a single SSP node), and it must sacrifice the ability 
to tolerate faults (such as forks) that appear as concurrency. 
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substrate, Dynamo [22], and it supports a range of 
application-level policies. For example, applications us- 
ing Depot may resolve conflicts by filtering (e.g., reads 
return the update by the highest-numbered node, reads 
return an application-specific merge of all updates, or 
reads return all updates) or by replacing (e.g., the ap- 
plication reads the multiple concurrent values, performs 
some computation on them, and then writes a new value 
that thus appears logically after, and thereby supersedes, 
the conflicting writes). 


3.3. System and threat model 


We now briefly state our technical assumptions. First, 
nodes are subject to standard cryptographic hardness as- 
sumptions, and each node has a public key known to all 
nodes. Second, any number of nodes can fail in arbitrary 
(Byzantine [41]) ways: they can crash, corrupt data, lose 
data, process some updates but not others, process mes- 
sages incorrectly, collude, etc. Third, we assume that any 
pair of timely, connected, and correct nodes can even- 
tually exchange any finite number of messages. That is, 
a faulty node cannot forever prevent two correct nodes 
from communicating (but we make no assumptions about 
how long “eventually” is). 

Fourth, above we used the term correct node. This 
term refers to a node that never deviates from the pro- 
tocol nor becomes permanently unavailable. A node that 
obeys the protocol for a time but later deviates is not 
counted as correct. Conversely, a node that crashes and 
recovers with committed state intact is equivalent to a 
correct node that is slow. Fifth, to ensure the liveness of 
garbage collection, we assume that unresponsive clients 
are eventually repaired or replaced. To satisfy this as- 
sumption, an administrator can install an unresponsive 
client’s keys and configuration on new hardware [15]. 


4 Core protocol 


In Depot, clients’ reads and updates to shared objects 
should always appear in an order that reflects the logic of 
higher layers. For example, an update that removes one’s 
parents from a friend list and an update that posts spring 
break photos should appear in that order, not the other 
way around [21]. However, Depot has two challenges. 
First, it aims for maximum availability, which fundamen- 
tally conflicts with the strictest orderings [26]. Second, it 
aims to provide its ordering guarantees despite arbitrary 
misbehavior from any subset of nodes. In this section, 
we describe how the protocol at Depot’s core achieves a 
sensible and robust order of updates while optimizing for 
availability and tolerating arbitrary misbehavior. 

As mentioned above, this basic protocol is run by both 
clients and servers. This symmetry not only simplifies 
the design but also provides flexibility. For example, if 
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servers are unreachable, clients can share data directly. 
For simplicity, the description below does not distinguish 
between clients and servers. 


4.1 Basic protocol 


This subsection describes the basic protocol to propagate 
updates, ignoring the problems raised by faulty nodes. 
The protocol is essentially a standard log exchange pro- 
tocol [10, 56]; we describe it here for background and to 
define terms. 

The core message in Depot is an update that changes 
the value associated with a key. It has the following form: 
dVV, {key, H(value), logicalClock@nodeID, H(history) } 


nodeID 


Updates are associated with logical times. A node as- 
signs each update an accept stamp of the form logical- 
Clock@nodeID [56]. A node N increments its logical 
clock on each local write. Also, when N receives an up- 
date u from another node, N advances its logical clock to 
exceed u’s. Thus, an update’s accept stamp exceeds the 
accept stamp of any update on which it depends [40]. The 
remaining fields, dVV and H(history), and the writer’s 
signature, Onoderp, defend against faults and are discussed 
in subsections 4.2 and 4.3. 

Each node maintains two local data structures: a log 
of updates it has seen and a checkpoint reflecting the cur- 
rent state of the system. For efficiency, Depot separates 
data from metadata [10], so the log and checkpoint con- 
tain collision-resistant hashes of values. If a node knows 
the hash of a value, it can fetch the full value from an- 
other node and store the full value in its checkpoint. Each 
node sorts the updates in its log by accept stamp, sort- 
ing first by logicalClock and breaking ties with nodelD. 
Thus, each new write issued by a node appears at the end 
of its own log and (assuming no faulty nodes) the log 
reflects a causally consistent ordering of all writes. 

Information about updates propagates through the sys- 
tem when nodes exchange tails of their logs. Each node 
N maintains a version vector VV with an entry for each 
node M in the system: N.VV[M] is the highest logical 
clock N has observed for any update by M [55]. To trans- 
mit updates from node M to node N, M sends to N the 
updates from its log that N has not seen. 

Two updates are logically concurrent if neither ap- 
pears in the other’s history. Concurrent writes may con- 
flict if they update the same object; conflicts are handled 
as described in Section 3.2. 


4.2 Consistency despite faults 


There are three fields in an update that defend the pro- 
tocol against faulty nodes. The first is a history hash, 
A (history), that encodes the history on which the update 
depends using a collision-resistant hash that covers the 
most recent update by each node known to the writer 
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when it issued the update. By recursion, this hash cov- 
ers all updates included by the writer’s current version 
vector. Second, each update is sent with a dependency 
version vector, dVV, that indicates the version vector that 
the history hash covers. Note that while dVV logically 
represents a full version vector, when node N creates an 
update u, u’s dVV actually contains only the entries that 
have changed since the last write by N. Third, a node 
signs its updates with its private key. 

A correct node C accepts an update u only if it meets 
five conditions. First, u must be properly signed. Sec- 
ond, except as described in the next subsection, u must 
be newer than any updates from the signing node that 
C has already received. This check prevents C from ac- 
cepting updates that modify the history of another node’s 
writes. Third, C’s version vector must include u’s dVV. 
Fourth, u’s history hash must match a hash computed 
by C across every node’s last update at time dVV. The 
third and fourth checks ensure that before receiving up- 
date u, C has received all of the updates on which u de- 
pends. Fifth, u’s accept stamp must be at most a constant 
times C’s current wall-clock time (e.g., u.acceptStamp < 
1000 * currentTimeMillis()). This check defends against 
exhaustion of the 64-bit logical time space. 

Given these checks, attempts by a faulty node to fabri- 
cate u and pass it as coming from a correct node, to omit 
updates on which u depends, or to reorder updates on 
which u depends will result in C rejecting u. To compro- 
mise causal consistency, a faulty node has one remaining 
option: to fork, that is, to show different sequences of 
updates to different communication partners [43]. Such 
behavior certainly damages consistency. However, the 
mechanisms above limit that damage, as we now illus- 
trate with an example. Then, in subsection 4.3 we de- 
scribe how Depot recovers from forks. 


Example: The history hash in action A faulty node M 
can create two updates uj;@y and uw @, such that neither 
update’s history includes the other’s. M can then send 
u;@m and the updates on which it depends to one node, 
M1, and wu @y and its preceding updates to another node, 
N2. N1 can then issue new updates that depend on up- 
dates from one of M’s forked updates (here, u;@y) and 
send these new updates to N2. At this point, absent the 
history hash, N2 would receive N1’s new updates with- 
out receiving the updates by M on which they depend: N2 
already received u @y, SO its version vector appears to al- 
ready include the prior updates. Then, if N2 applies just 
N1’s writes to its log and checkpoint, multiple consis- 
tency violations could occur. First, the system may never 
achieve eventual consistency because N2 may never see 
write “;@y. Further, the system may violate causality be- 
cause N2 has updates from N1 but not some earlier up- 
dates (e.g., u;@y) on which they depend. 

The above confusion is prevented by the history hash. 
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If N1 tries to send its new updates to N2, N2 will be 
unable to match the new updates’ history hashes to the 
updates N2 actually observed, and N2 will reject N1’s 
updates (and vice-versa). As a result, NJ and N2 will be 
unable to exchange any updates after the fork junction 
introduced by M after ujp@y. 


Discussion At this point, we have composed mecha- 
nisms from Bayou [56] and PRACTI [10] (update ex- 
change), SUNDR [43] (signed version vectors), and 
BFT2F [44] (history hashes, here used by clients and 
modified to apply to history trees instead of linear his- 
tories) to provide fork-causal consistency (FCC) under 
arbitrary faults. We define FCC precisely in a technical 
report [45]. Informally, it means that each node sees a 
causally consistent subset of the system’s updates even 
though the whole system may no longer be causally con- 
sistent. Thus, although the global history has branched, 
as each node peers backward from its branch to the be- 
ginning of time, it sees causal events the entire way. 
Unfortunately, enforcing even this weakening of 
causal consistency would prohibit eventual consistency, 
crippling the system: FCC requires that once two nodes 
have been forked, they can never observe one another’s 
updates after the fork junction [43]. In many environ- 
ments, partitioning nodes this way is unacceptable. In 
those cases, it would be far preferable to further weaken 
consistency to ensure an availability property: connected, 
correct nodes can always share updates. We now de- 
scribe how Depot achieves this property, using a new 
mechanism: joining forks in the system’s history. 


4.3 Protecting availability: Joining forks 


To join forks, nodes use a simple coping strategy: they 
convert concurrent updates by a single faulty node into 
concurrent updates by a pair of virtual nodes. A node that 
receives these updates handles them as it would “normal” 
concurrency: it applies both sets of updates to its state 
and, if both branches modify the same object, it returns 
both conflicting updates on reads (§3.2). We now fill in 
some details. 


Identifying a fork First consider a two-way fork. A fork 
junction comprises exactly three updates where a faulty 
node M has created two updates (e.g., wi@u and uj @y) 
such that (i) neither update includes the other in its his- 
tory and (ii) each update’s history hash links it to the 
same previous update by that writer (e.g., uo@m). If a 
node N2 receives from a node N1 an update whose his- 
tory is incompatible with the updates it has already re- 
ceived, and if neither node has yet identified the fork 
junction, V1 and N72 identify the three forking updates as 
follows. First, N1 and N2 perform a binary search on the 
updates included in the nodes’ version vectors to identify 
the latest version vector, VV, encompassing a common 
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history. Then, NV 1 sends its log of updates beginning from 
VV.. Finally, at some point, N2 receives the first update 
by M (e.g., ui1@m) that is incompatible with the updates 
by M that N2 has already received (e.g., uo@m and u}@y)- 


Tracking forked histories After a node identifies the 
three updates in the fork junction, it expands its version 
vector to include three entries for the node that issued the 
forking updates. The first is the pre-fork entry, whose in- 
dex is the index (e.g., M) before the fork and whose con- 
tents will not advance past the logical clock of the last 
update before the fork (e.g., uo@j). The other two are 
the post-fork entries, whose indices consist of the index 
before the fork augmented with the history hash of the re- 
spective first update after the fork. Each of these entries 
initially holds the logical clock of the first update after 
the fork (e.g., of ui@y and u)@,,); these values advance 
as the node receives new updates after the fork junction. 

Note that this approach works without modification 
if a faulty node creates a j-way fork, creating updates 
Ute@m> “i@m> «+> Ur@m that link to the same prior up- 
date (e.g., uo@mu). The reason is that, regardless of the 
order in which nodes detect fork junctions, the branches 
receive identical names (because branches are named by 
the first update in the branch). A faulty node that is re- 
sponsible for multiple dependent forks does not stymie 
this construction either. After i dependent forks, a virtual 
node’s index in the version vector is well-defined: it is 
M || A(ujork,) || H(Ujork) |---| A eyore,) [56]. 


Log exchange revisited The expanded version vector 
allows a node to identify which updates to send to a peer. 
In the standard protocol, when a node N2 wants to re- 
ceive updates from N1, it sends its current version vector 
to N1 to identify which updates it needs. After N2 de- 
tects a fork and splits one version vector entry into three, 
it simply includes all three entries when asking N1 for 
updates. Note that V1 may not be aware of the fork, but 
the history hashes that are part of the indices of N2’s ex- 
panded version vector (as per the virtual node construc- 
tion above) tell N2 to which branch N1’s updates should 
be applied and tell V1 which updates to actually send. 
Conversely, if the sender N1 has received updates that 
belong to neither branch, then N1 and N2 identify the 
new fork junction as described above. 


Bounding forks The overhead of this coping strategy 
is the space, bandwidth, and computation needed for 
fork detection and larger version vectors. Depot bounds 
the number of forks that faulty nodes can introduce 
by (1) making nodes “vouch” for updates by a forking 
node that they had received before learning of the fork 
and (2) making them promise not to communicate with 
known forking nodes. We omit the details for space. 
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Safety/ Correct nodes 











Dimension Liveness _ Property required 
Consistency Safety Fork-Join Causal Any subset 
Safety Bounded staleness Any subset 
Safety Eventual consistency (s) | Any subset 
Availability Liveness Eventual consistency (1) | Any subset 
Liveness Always write Any subset 
Liveness Always exchange Any subset 
Liveness Write propagation Any subset 
Liveness _ Read availability / A correct node 
durability has object 
Integrity Safety Only auth. updates Clients 
Recoverability Safety Valid discard 1 correct client 
Eviction Safety Valid eviction Any subset 


Fic. 2—Summary of properties provided by Depot. 
5 Properties and guarantees 


This section describes how Depot enforces needed prop- 
erties with minimal trust assumptions. Figure 2 summa- 
rizes these properties and lists the required assumptions. 
Below, we define these properties and describe how De- 
pot provides them. The key idea is that the replication 
protocol enforces Fork-Join-Causal consistency (FJC). 
Given FJC, we can constrain and reason about the order 
that updates propagate and use those constraints to help 
enforce the remaining properties. 


5.1 Fork-Join-Causal consistency 


Clients expect a storage service to provide consistent ac- 
cess to stored data. Depot guarantees a new consistency 
semantic for all reads and updates to a volume that are 
observed by any correct node: Fork-Join-Causal consis- 
tency (FJC). A formal description of FJC appears in our 
technical report [45]. Here we describe its core property: 


¢ Dependency preservation. If update u; by a correct 
node depends on an update ug by any node, then ug be- 
comes observable before u, at any correct node. (An 
update u of an object o is observable at a node if a read 
of o would return a version at least as new as u [25].) 


To explain FJC, we contrast it with causal consistency 
(CC) in fail-stop systems [7, 40, 56]. CC is based on a 
dependency preservation property that is identical to the 
one above, except that it omits the “correct nodes” quali- 
fication. Thus, to applications and users, FJC appears al- 
most identical to causal consistency with two exceptions. 
First, under FJC, a faulty node can issue forking writes w 
and w’ such that one correct node observes w without 
first observing w’ while another observes w’ without first 
observing w. Second, under FJC, faulty nodes can issue 
updates whose stated histories do not include all updates 
on which they actually depend. For example, when cre- 
ating the forking updates w and w’ just described, the 
faulty node might have first read updates uc; and uc 
from nodes C1 and C2, then created w that claimed to 
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depend on uc, but not uc2, and finally created update w’ 
that claimed to depend on uc but not uc). Note, however, 
that once a correct node observes w (or w’), it will include 
w (or w’) in its subsequent writes’ histories. Thus, as cor- 
rect nodes observe each others’ writes, they will also ob- 
serve both w and w’ and their respective dependencies in 
a consistent way. Specifically, w and w’ will appear as 
causally concurrent writes by two virtual nodes ($4.3). 
Though FJC is weaker than linearizability, sequential 
consistency, or causal consistency, it still provides prop- 
erties that are critical to programmers. First, FJC implies 
a number of useful session guarantees [66] for programs 
at correct nodes, including monotonic reads, monotonic 
writes, read-your-writes, and writes-follow-reads. Sec- 
ond, as we describe in the subsections below, FJC is the 
foundation for eventual consistency, for bounded stale- 
ness, and for further properties beyond consistency. 
Stronger consistency during benign runs. Depot 
guarantees FJC even if an arbitrary number of nodes 
fail in arbitrary ways. However, it provides a stronger 
guarantee—causal consistency—during runs with only 
omission failures. Of course, causal consistency itself is 
weaker than sequential consistency or linearizability. We 
accept this weakening because it allows Depot to remain 
available to reads and writes during partitions [22, 26]. 


5.2 Eventual consistency 


The term eventual consistency is often used informally, 
and, as the name suggests, it is usually associated with 
both liveness (“eventual”) and safety (“consistency”). 
For precision, we define eventual consistency as follows. 


¢ Eventual consistency (safety). Successful reads of an 
object at correct nodes that observe the same set of 
updates return the same values. 


¢ Eventual consistency (liveness). Any update issued or 
observed by a correct node is eventually observable by 
all correct nodes. 


The safety property is directly implied by FJC. The 
liveness property is ensured by the replication proto- 
col (§4), which entangles updates to prevent selective 
transmission, and by the communication heuristics (§6), 
which allow a node that is unable to communicate with a 
server to communicate with any other server or client. 


5.3 Availability and durability 


In this subsection, we consider availability of reads, of 
writes, and of update propagation. We also consider 
durability. We begin by noting that the following strong 
availability properties follow from the protocol in §4 and 
the communication heuristics (§6): 


¢ Always write. An authorized node can always update 
any object. 
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¢ Always exchange. Any subset of correct nodes can ex- 
change any updates that they have observed, assuming 
they can communicate as per our model in §3.3. 


¢ Write propagation. If a correct node issues a write, 
eventually all correct nodes observe that write, assum- 
ing that any message sent between correct nodes is 
eventually delivered. 


Unfortunately, there is a limit to what any storage sys- 
tem can guarantee for reads: if no correct node has an ob- 
ject, then the object may not be durable, and if no correct, 
reachable node has an object, then the object may not be 
available. Nevertheless, we could, at least in principle, 
still have each node rely only on itself for read avail- 
ability and durability: nodes could propagate updates and 
values, and all servers and all clients could store all val- 
ues. However, fully replicating all data is not appealing 
for many cloud storage applications. 

Depot copes with these limits in two ways. First, De- 
pot provides guarantees on read availability and durabil- 
ity that minimize the required number of correct nodes. 
Second, Depot makes it likely that this number of cor- 
rect nodes actually exists. The guarantees are as follows 
(note that durability—roughly, “the system does not per- 
manently lose my data’”—manifests as a liveness prop- 
erty): 

¢ Read availability. If during a sufficiently long syn- 
chronous interval any reachable and correct node has 
an object’s value, then a read by a correct node will 
succeed. 


¢ Durability. If any correct hoarding node, as defined 
below, has an object’s value, then a read of that object 
will eventually succeed. That is, an update is durable 
once its value reaches a correct node that will not pre- 
maturely discard it. 


A hoarding node is a node that stores the value of a ver- 
sion of an object until that version is garbage collected 
($5.6). In contrast, a caching node may discard a value at 
any time. 

To make it likely that the premise of the guarantees 
holds—namely that a correct node has the data—Depot 
does three things. First, its configuration replicates data 
to survive important failure scenarios. All servers usually 
store values for all updates they receive: except as dis- 
cussed in the remainder of this subsection, when a client 
sends an update to a server and when servers transmit 
updates to other servers, the associated value is included 
with the update. Additionally, the client that issues an 
update also stores the associated value, so even if all 
servers become unavailable, clients can fetch the value 
from the original writer. Such replication allows the sys- 
tem to handle not only the routine failure case where a 
subset of servers and clients fail and lose data but also the 
client disaster and cloud disaster cases where all clients 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


or all servers fail [5, 14] or become unavailable [8]. 

Second, receipts allow a node to avoid accepting an 
insufficiently-replicated update. When a server processes 
an update and stores the update’s value, it signs a receipt 
and sends the receipt to the other servers. Then, we ex- 
tend the basic protocol to require that an update carry 
either (a) a receipt set indicating that at least k servers 
have stored the value or (b) the value, itself. 

Thus, in normal operation, servers receive and store 
updates with values, and clients receive and store updates 
with receipt sets. However, if over some interval, fewer 
than k servers are available, clients will instead receive, 
store, and propagate both updates and values for updates 
created during this interval. Finally, although servers nor- 
mally receive updates and values together, there are cor- 
ner cases where—to avoid violating the always exchange 
property—they must accept an update with only a receipt 
set. Thus, in the worst case Depot can guarantee only 
that an object value not stored locally is replicated by the 
client that created it and by at least k servers. 

Third, if a client has an outstanding read for version 
v, it withholds assent to garbage collect v ($5.6) until the 
read completes with either v or a newer version. 


5.4 Bounded staleness 


A client expects that soon after it updates an object, other 

clients that read the object see the update. The following 

guarantee codifies this expectation: 

¢ Bounded staleness. If correct clients C1 and C2 have 
clocks that remain within A of a true clock and Cl 

updates an object at time fo, then by no later than fo + 

2Tann + Tprop + A, either (1) the update is observable 

to C2 or (2) C2 suspects that it has missed an update 
from C1. 
Tonn and Tyyop are configuration parameters indicating 
how often a node announces its liveness and how long 
propagating such announcements is expected to take; 
both are typically a few tens of seconds. 

Depot uses FJC consistency to guarantee that a client 
always either knows it has seen all recent updates or sus- 
pects it has not. Every Tan, seconds, each client updates 
a per-client beacon object [43] in each volume with its 
current physical time. When C2 sees that C1’s beacon 
object indicates time f, then C2 is guaranteed—by FJC 
consistency—to see all updates issued by C1 before time 
t. On the other hand, if C1’s beacon object does not show 
a recent time, C2 suspects that it may not have seen other 
recent updates by Cl. 

When C2 suspects that it has missed updates from C1, 
it switches to receiving updates from a different server. 
If that does not resolve the problem, C2 tries to contact 
C1 directly to fetch any missed updates and the updates 
on which those missed updates depend. 
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Applications use the above mechanism as follows. If 
anode suspects missing updates, then an application that 
calls GET has two options. First, GET can return a warn- 
ing that the result might be stale. This option is our de- 
fault; it provides the bounded staleness guarantee above. 
Alternatively, an application that prefers to trade worse 
availability for better consistency [26] can retry with dif- 
ferent servers and clients, blocking until the local client 
has received all recent beacons. 

Note that a faulty client might fail to update its beacon, 
making all clients suspicious all the time. What, then, are 
the benefits of this bounded staleness guarantee? First, 
although Depot is prepared for the worst failures, we ex- 
pect that it often operates in benign conditions. When 
clients, servers, and the network operate properly, clients 
are given an explicit guarantee that they are reading fresh 
data. Second, when some servers or network paths are 
faulty, suspicion causes clients to fail-over to other com- 
munication paths to get recent updates. 

Bounded staleness v. FJC. Bounded staleness and 
FJC consistency are complementary properties in Depot. 
Without bounded staleness, a faulty server could serve a 
client an arbitrarily old snapshot of the system’s state— 
and be correct according to FJC. Conversely, bounding 
staleness without a consistency guarantee (assuming that 
is even possible; we bound staleness by relying on con- 
sistency) is not enough. For engineering reasons, our 
staleness guarantees are tens of seconds; absent consis- 
tency guarantees, applications would get confused be- 
cause there could be significant periods of time when 
some updates are visible, but related ones are not. 


5.5 Integrity and authorization 


Under Depot, no matter how many nodes are faulty, only 
authorized clients can update a key/value pair in a way 
that affects correct clients’ reads: the protocol requires 
nodes to sign their updates, and correct nodes reject 
unauthorized updates. 

A natural question is: how does the system know 
which nodes are authorized to update which objects? Our 
prototype takes a simple approach. When a volume is 
created, it is statically configured to associate ranges of 
lookup keys with specific nodes’ public keys. This al- 
lows specific clients to write specific subsets of the sys- 
tem’s objects, and it prevents servers from modifying 
the objects that they store on behalf of clients. Imple- 
menting more sophisticated approaches to key manage- 
ment [48, 71] is future work. We speculate that Depot’s 
FJC consistency will make it relatively easy to ensure 
a sensible ordering of policy updates and access control 
decisions [24, 71]. 


5.6 Data recovery 


Even if a storage system retains a consistent and fresh 
view of the data written to it, data owners care about end- 
to-end reliability, and the applications and users above 
the storage system pose a significant risk. For example, 
many of the failures listed in the introduction may cor- 
rupt or destroy valuable data. Depot does not attempt to 
distinguish “good” updates from “bad” ones or advance 
the state of the art in protecting storage systems from bad 
updates. Depot’s FJC consistency does, however, provide 
a basis for applying many standard defenses. For exam- 
ple, Depot can keep all versions of the objects in a vol- 
ume, or it can provide a basic laddered backup scheme 
(all versions of an object kept for a day, daily versions 
kept for a week, weekly versions kept for a month, and 
monthly versions kept for a year). 

Given FJC consistency, implementing laddered back- 
ups is straightforward. Initially, servers retain every up- 
date and value that they receive, and clients retain the 
update and value for every update that they create. 
Then, servers and clients discard the non-laddered ver- 
sions by unanimous consent of clients. Every day, clients 
garbage collect a prefix of the system’s logs by produc- 
ing a checkpoint of the system’s state (using techniques 
adopted from Bayou [56]). The checkpoint includes in- 
formation needed to protect the system’s consistency and 
a candidate discard list (CDL) that states which prior 
checkpoints and which versions of which objects may be 
discarded. The job of proposing the checkpoint rotates 
over the clients each day. 

The keys to correctness here are (a) a correct client 
will not sign a CDL that would delete a checkpoint pre- 
maturely and (b) a correct node discards a checkpoint or 
version if and only if it is listed in a CDL signed by all 
clients. These checks ensure the following property: 


¢ Valid discard. If at least one client is correct, a correct 
node will never discard a checkpoint or a version of an 
object required by the backup ladder. 


Note that a faulty client cannot cause the system to 
discard data that it needs: the above approach provides 
the same read availability and durability guarantees for 
backup versions as for the current version ($5.3). How- 
ever, a faulty client can delay garbage collection. If a 
checkpoint fails to garner unanimous consent, clients no- 
tify an administrator, who troubleshoots the faulty client 
or, if all else fails, replaces it with a new machine. Thus, 
faulty clients can cause the system to consume extra 
storage—but only temporarily, assuming that unrespon- 
sive clients are eventually repaired or replaced (§3.3). 


5.7 Evicting faulty nodes 


Depot evicts nodes that provably deviate from the proto- 
col (e.g., by issuing forking writes) and ensures: 
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¢ Valid eviction. No correct node is ever evicted. 


For space, we discuss eviction only at a high level; 
details are in our technical report [45]. We use proofs of 
misbehavior (POMs): because nodes’ updates are signed, 
many misbehaviors are provable as such. For example, 
when a node N observes forking writes from a faulty 
node M, it creates a POM and slots the POM into the 
update log, ensuring that the POM will propagate. Note 
that eviction occurs only if there is a true proof of mis- 
behavior. If a faulty node is merely unresponsive, that is 
handled exactly as SLA violations are today. 


6 Implementation 


Our Depot prototype is implemented in Java. It keeps ev- 
ery version written so does not implement laddered back- 
ups or garbage collection (§5.6). It is otherwise com- 
plete (but not optimized). The prototype uses Berkeley 
DB (BDB) for local storage and does so synchronously: 
after writing to BDB, Depot calls commit before return- 
ing to the caller, and we configure BDB to call fsync on 
every commit.* 

Implementation of GET & PUT. Depot clients expose 
a PUT and GET API and implement these calls over the 
log exchange protocol (§4). Recall that Depot separates 
data from metadata and that an update is only the meta- 
data. Each client node chooses a (usually nearby) pri- 
mary server and fetches updates via background gossip. 

On a PUT, a client first locally stores the update and 
value. As an optimization, rather than initiate the log ex- 
change protocol, a client just sends the update and value 
of each PUT directly to its primary server. If the update 
passes all consistency checks and the value matches the 
hash in the update, the server adds these items to its 
log and checkpoint. Otherwise, the client and server fall 
back on log exchange. Similarly, servers send updates 
and bodies to each other “out of band” as they are re- 
ceived; if two servers detect that they are out of sync, 
they fall back on log exchange. 

On a GET, a client sends the requested lookup key, 
k, to its primary server along with a staleness hint. The 
staleness hint is a set of two-byte digests, one per log- 
ically latest update of k that the client has received via 
background gossip; note that unless there are concurrent 
updates to k, the staleness hint contains one element. If 
the staleness hint matches the latest updates known to 
the server, the server responds with the corresponding 
values. The client then checks that these values corre- 
spond to the H(value) entries in the previously received 
updates. If so, the client returns the values to the appli- 


4This approach aids, but does not quite guarantee, persistence of 
committed data: “synchronous” disk writes in today’s systems do not 
always push data all the way to the disk’s platter [52]. Note that if a 
node commits data and subsequently loses it because of an ill-timed 
crash, Depot handles that case as it does with any other faulty node. 
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Depot’s overheads are modest. E.g., for 1OKB requests 99%-tile 


latency for GET falls from 2.1 ms to 1.6 ms; for PUT itincreases $7.1 
from 14.8 ms to 27.7 ms. 

Depot imposes little additional cost for read-mostly workloads. | 
For example, Depot’s weighted dollar cost of 1OKB GETs and §7.2 
PUTs are 2% and 56% higher than the baseline. 

Depot continues correct operation when failures occur with lit- _g7.3 


tle impact on latency or resource consumption. 


Fic. 3—Summary of main evaluation results. 











Baseline Clients trust the server to handle their PUTs and 
GETs correctly. Clients neither maintain local state 
nor perform checks on returned data. 

B+Hash Clients attach SHA-256 hashes to the values that 
they PUT and verify these hashes on GETs. 

B+H+Sig Clients sign the values that they PUT and verify 
these signatures on GETs. 

B+H+S+Store The same checks as B+H+Sig, plus clients locally 


store the values that they PUT, for durability and 
availability despite server failures. 


FIG. 4—Baseline variants whose costs we compare to Depot’s. 


cation, completing the GET. If the server rejects the stal- 
eness hint or if the values do not match, then the client 
initiates a value and update transfer by sending to its pri- 
mary server (a) its version vector and (b) k. The server 
replies with (a) the missing updates, which the client ver- 
ifies ($4.2), and (b) the most recent set of values for k. 

If a client cannot reach its primary server, it randomly 
selects another server (and does likewise if it cannot 
reach that server). If no servers are available, the client 
enters “client-to-client mode” for a configurable length 
of time, during which it gossips with the other clients. In 
this mode, on a PUT, the client responds to the applica- 
tion as soon as the data reaches the local store. On a GET, 
the client fetches the values from the clients that created 
the latest known updates of the desired key. 


7 Experimental evaluation 


The principal question that drives our evaluation is: what 
is the “price of distrust?” That is, how much do Depot’s 
guarantees cost, relative to the costs of a baseline storage 
system? We measure latency, network traffic, storage at 
both clients and servers, and CPU cycles consumed at 
both clients and servers (§7.1). We then convert the re- 
source overheads into a common currency [29] using a 
cost model loosely based on the prices charged by to- 
day’s storage and compute services (§7.2). We then move 
from “stick” to “carrot”, illustrating Depot’s end-to-end 
guarantees (§7.3). Figure 3 summarizes our results. 


Method and environment Most of our experiments 
compare our Depot implementation to a set of baseline 
key-value storage systems, described in Figure 4. All of 
them replicate the key-value pairs to a set of servers, us- 
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FIG. 5—Latencies ((a) mean and standard deviation and (b) 99th percentile) for GETs and PUTs for various object sizes in Depot 
and the four baseline variants. For small- and medium-sized requests, Depot introduces negligible GET latency and sizeable latency 
on PUTs, the extra overhead coming from signing, synchronously storing a local copy, and Depot’s additional checks. For large 
requests, collision-resistant hashing adds significant latency to both PUTs and GETs. 


ing version vectors to detect precedence, but omit one or 
more of Depot’s safeguards. In none of the variants do 
clients check version vectors or maintain history hashes. 
We have implemented these baseline variants using the 
same code base as Depot, so they are not heavily opti- 
mized. For example, as in Depot, the baselines separate 
data from metadata, causing writes to two Berkeley DB 
tables on every PUT, which is possibly inefficient com- 
pared to a production storage system. Such inefficiencies 
may lead to our underestimating Depot’s overhead. 

Our default configuration is as follows. There are 8 
clients and 4 servers with the servers connected in a mesh 
and two clients connecting to each server. Servers gossip 
with each other once per second; a client gossips with its 
primary server every 5 seconds. We experiment with a 
slightly older implementation that runs without receipts 
(§5.3) and beaconing (§5.4). 

Our default workload is as follows. Clients issue a se- 
quence of PUTs and GETs against a volume preloaded 
with 1000 key-value pairs. We partition the write key set 
into several non-overlapping ranges, one for each client. 
As a result, a GET returns a single value, never a set. A 
client chooses write keys randomly from its write key 
range and read keys randomly from the entire volume. 
We fix the key size at 32 bytes. In each run, each client is- 
sues 600 requests at roughly one request per second. We 
examine three different value sizes (3 bytes, 10 KB, and 
1 MB) and the following read-write percentages: 0/100, 
10/90, 50/50, 90/10, and 100/0. (We do not report the 
10/90 and 90/10 results; their results are consistent with, 
and can be predicted by, those from the other workloads). 

We use a local Emulab [70]. All hosts run Linux 
FC 8 (version 2.6.25.14-69) and are Dell PowerEdge 
1200 servers, each with a quad-core Intel Xeon X3220 
2.40 GHz processor, 8 GB of RAM, two 7200RPM local 
disks, and one | Gigabit Ethernet port. 


7.1 Overhead of Depot 


Latency To evaluate latencies in Depot and the baseline 
systems, we measure from the point of view of the appli- 
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cation, from when it invokes GET or PUT at the local li- 
brary until that call returns. Note that for a PUT, the client 
commits the PUT locally (if it is a Depot or B+H+S+Store 
client) and only then contacts the server, which replies 
only after committing the PUT. We report means, stan- 
dard deviations, and 99th percentiles, from the GET (i.e., 
100/0) and PUT (i.e., 0/100) workloads. 

Figure 5 depicts the results. For the GET runs, the 
difference in means between Baseline and B+Hash are 
0.0, 0.2, and 15.2 ms for 3B, 1OKB, and 1MB, respec- 
tively, which are explained by our measurements of mean 
SHA-256 latencies in the cryptographic library that De- 
pot uses: 0.1, 0.2, and 15.7 ms for those object sizes. 
Similarly, the means of RSA-Verify operations explain 
the difference between B+Hash and B+H+Sign for 3B 
and 10KB, but not for 1MB; we still investigating over- 
heads for that latter case. Note that Depot’s GET latency 
is lower than that of the strongest two baselines. The rea- 
son is that Depot clients verify signatures in the back- 
ground, whereas the baseline variants do so on the criti- 
cal path. A key observation is that, for GETs, Depot does 
not introduce much latency beyond applying a collision- 
resistant hash to data stored in an SSP—which prudent 
applications likely do anyway. 

For PUTs, the latency is higher. Each step from 
B+Hash to B+H+S to B+H+S+Store to Depot adds sig- 
nificantly to mean latency, and for large requests, going 
from Baseline to B+Hash does as well. For example, the 
mean latency for 1|OKB PUTs ascends 3.8 ms, 3.9 ms, 
8.5 ms, 9.7 ms, 13.0ms as we step through the sys- 
tems; 99%-tile latency goes 14.8 ms, 15.1 ms, 20.4 ms, 
37.9 ms, 27.8 ms. 

We can explain the observed Depot PUT latency with 
a simple model. Depot handles PUTs serially, so we sum 
the overheads of a PUT’s components (microbenchmark 
means for 1OKB are in parentheses, with estimates de- 
noted =): the client hashes the value (0.2 ms), hashes 
history (~ 0.1 ms), signs the update (4.2 ms), stores the 
body (2.6 ms, with the DB cache enabled), stores the up- 
date (= 1.5 ms), and transfers the update and body over 
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FIG. 6—Resource use of Baseline, B+Hash, B+H+Sig, B+H+S+Store, and Depot. The bar heights represent resource use normal- 
ized to Depot, for 10 KB objects and the 100/0 and 0/100 workloads. The labels indicate the actual values. (C) and (S) indicate the 
average per-request resource use at clients and servers, respectively. (C-S) and (C-S) are client-server and server-server network 
use, respectively. For storage costs (labeled Stor/Ver), we report the cost of storing a version of an object. 


the | Gbps network (* 0.1 ms); the server verifies the 
signature (0.3 ms), hashes the value (0.2 ms), hashes his- 
tory (= 0.1 ms), and stores the body (2.6 ms) and update 
(= 1.5 ms). The sum of the means (13.4 ms) is close to 
the observed latency of 13.0 ms. The model is similarly 
accurate for the 3B experiments but off by 20% for 1MB; 
we hypothesize that the divergence stems from queues 
that build in front of BDB during periodic log exchange. 

These PUT latencies could be reduced. For example, 
we have not exploited obvious pipelining opportunities. 
Also, we experiment on a | Gbit/s LAN; in many cloud 
storage deployments, WAN delays would dominate la- 
tencies, shrinking Depot’s percentage overhead. 


Resource utilization Figure 6 depicts the overheads of 
various resources in the experiments run above. Depot’s 
overheads are small for GET bandwidth and CPU, for 
server-server bandwidth, and for server storage cost. The 
PUT client-server bandwidth overheads are about 20%. 
The PUT client CPU overheads are substantial due to 
the additional Berkeley DB access and cryptographic 
checks. Client storage overheads are also substantial due 
to the added requirement that clients store data for the 
PUTs that they create and metadata for all PUTs. 


7.2 Dollar cost 


Different resources have different costs. To characterize 
Depot’s overall cost, we convert the measured overheads 
from the prior subsection into dollars. We use the follow- 
ing cost model, loosely based on what customers pay to 
use existing cloud storage and compute resources. 


Client-server network bandwidth $.10/GB 
Server-server network bandwidth $.01/GB 
Disk storage (one client or server)  $.025/GB per month 


CPU processing (client or server)  $.10 per hour 


Figure 7 shows the overheads from Figure 6 weighted 
by these costs. Depot’s overheads are modest for read- 
mostly workloads. Depot’s GET costs are only slightly 
higher than Baseline’s: $108.10 v. $106.50 for 10° GET 
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FIG. 7—Dollar cost to GET 1TB of data, PUT 1TB of data, or 
store 1TB of data for 1 month. Each object has a small key 
and a 10KB value. 1TB of PUTs or GETs corresponds to 10° 
operations, and 1TB of storage corresponds to 10° objects. 


operations on 1OKB objects. However, Depot’s PUT costs 
are over 50% higher: $234.40 v. $150.50 for 10° opera- 
tions on LOKB objects. Most of the extra cost is from dis- 
tributing and verifying metadata across all nodes, so the 
relative overheads would fall for larger objects. Depot’s 
storage costs are 31% higher than Baseline’s: $138.50 v. 
$105.50 to store 10° 10KB objects for a month. Most of 
the extra cost is from storing a copy of each object at the 
issuing client; the rest is from storing metadata. 


7.3. Experiments with faults 


We now examine Depot’s behavior when servers become 
unavailable and when clients create forking writes. 


Server unavailability In this experiment, 8 clients ac- 
cess 8 objects on 4 servers. The objects are 1OKB, and 
the workload is 50/50 GET/PUT. Servers gossip with ran- 
dom servers every second, and clients gossip with their 
chosen partner (initially a server) every 5 seconds. 300 
seconds into the experiment, we stop all servers. By post- 
processing logs, we measure the staleness of GET results, 
compared to instantaneous propagation of all updates: 
the staleness of a GET’s result is the time since that re- 
sult was overwritten by a later PUT. If the GET returns 
the most recent update, the staleness is 0. 
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FIG. 8—The effect of total server failure (t = 300) on (a) staleness and (b) latency. The workload is 50/50 R/W and 10KB objects. 
For space, we omit the graph of PUT latency for this experiment. Depot maintains availability through client-to-client transfers 
whereas the baseline system blocks, and GET latency actually improves (at the expense of staleness). 
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FIG. 9—GET latency seen by a correct client in three runs: 
8 correct clients (8COF), 6 correct clients and 2 faulty clients 
(6C2F), and 6 correct clients (6COF). The results for PUT la- 
tency are not depicted but are the same: Depot survives forks 
without affecting client-perceived latency. 


Figure 8(a) depicts the staleness observed at one 
client. Before the servers fail, GETs in both Depot 
and B+H+S+Store have low staleness. After the failure, 
B+H+S+Store blocks forever. Depot, however, switches 
to client-to-client mode, continuing to service requests. 
Staleness increases noticeably both because it takes more 
network hops to disseminate updates and because the 
lower gossip frequency increases the delay between 
hops. 

Figure 8(b) depicts the latency of GETs observed by 
the same client. Prior to the failure, Depot’s GET latency 
is significantly higher than measured in the experiments 
in §7.1 because the workload here has just 8 objects, each 
of which is updated every 2 seconds, so the optimization 
described in §6 often fails, forcing the client and server 
to perform a log exchange before the GET can complete. 
When the servers fail, Depot continues to function, and 
GET latency actually improves: rather than requesting 
“the current” value from the server (and then completing 
a log exchange to get the new metadata required to vali- 
date the newest update), in client-to-client mode, a client 
fetches the specific version mentioned in the update it 
already has from the writer. Though not depicted, De- 
pot’s PUT latency also improves in client-to-client mode: 
PUT operations return as soon as the update and value are 
stored locally, with no round trip to a server. 


Client fork In this experiment, 8 correct clients (8COF), 
6 correct clients and 2 faulty clients (6C2F), and 6 correct 
clients (6COF) access 1000 objects on 4 servers. The ob- 
jects are 1OKB, and the workload is 50/50 GET/PUT. 300 
seconds into the experiment, faulty clients begin to issue 
forking writes. When a correct client observes a fork, 
it creates and publishes a proof of misbehavior (POM) 
against the faulty client, and when servers or other clients 
receive the POM, they stop accepting new writes directly 
from the faulty client. 

Figure 9 depicts the results for GETs. Forks introduced 
by faulty clients do not have obvious effect on GET or 
PUT latency; note that the spikes in GET latency prior to 
t = 300 are unrelated to client failures. We also measured 
CPU consumption and found no interesting differences 
among the intervals before the failures, at the time of the 
failures, or after the faulty nodes had been evicted. 


8 Teapot for legacy SSPs 


Depot runs on both clients and SSP nodes, but it would 
be desirable to provide Depot’s guarantees using unmod- 
ified legacy SSPs such as S3, Azure Storage, or Google 
Storage. Intuitively, such an approach appears possible. 
In Depot, servers must (1) propagate updates among 
clients and (2) provide update bodies (i.e., values) in re- 
sponse to GET requests. We should be able to use an 
SSP’s abstract key-value map as a communication chan- 
nel and as storage for update bodies. And because Depot 
clients verify everything that they receive from servers, 
we should still be able to provide most of the properties 
discussed in §5. In this section, we give a brief overview 
of Teapot, a variation of Depot that uses legacy SSPs. 
Teapot assumes an API like that of S3: LPUT(k, v, b) 
(associate v with k in a bucket b owned by a given client) 
and LGET(k,b) (return v). On a PUT, the Teapot client 
creates and locally stores the metadata u (a Depot up- 
date) and the data d (a Depot value). The client then 
stores both to the SSP by calling LPUT(H(u), u,b.) and 
LPUT(H(d),d,b-), where b, is a bucket that only c can 
write. The client then identifies its latest update by stor- 
ing it to a distinguished key, k* (that is, the client exe- 
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Fic. 10—Average latencies (with standard deviations) per- 
ceived by Teapot for GET and PUT operations with 10KB pay- 
load when using Amazon S3 for storage. 


cutes LPUT(k*, u, b.)). In the background, the client peri- 
odically fetches the other clients’ latest updates by read- 
ing their k* entries and then fetching and validating the 
updates’ dependencies. On a GET, the Teapot client uses 
LGET to retrieve the value(s) associated with the latest 
update(s) that it has received. 

We have prototyped Teapot using S3 and a variation on 
the arrangement just sketched. As shown in Figure 10, 
accessing $3 through Teapot rather than through LPUT 
and LGET introduces little latency over S3; the baseline 
latencies to S3 are already scores of milliseconds, so the 
additional overheads are small. The other resource costs 
(client-side storage, extra bandwidth, etc.) are similar to 
those of Depot (§7.1). 


Discussion Teapot differs from Depot in two important 
ways. First, if a client fails in particular ways, Teapot 
cannot guarantee valid discard (§5.6). A client c can, 
for example, issue a PUT, allow the update to be ob- 
served by other clients, and then delete the value associ- 
ated with the update. Second, Teapot servers cannot pro- 
vide the durability receipts that Depot clients use to avoid 
depending on insufficiently-replicated data (§5.3). Note 
that Teapot tolerates arbitrary SSP failures and many 
other client failures (crashes, forks, etc.), so Teapot’s ad- 
ditional vulnerability over Depot is limited and may be 
justified by its deployability. 

We now ask: what incremental extensions to SSPs 
would allow us to run code only on clients but recover 
Depot’s full guarantees? We speculate that the following 
suffices. First, to allow a correct client to avoid depend- 
ing on updates that a faulty client could delete, the SSP 
could implement LINK(K,b.,b-’), UNLINK(k, bc, be), 
and VERIFY (k, H, b,). LINK causes every existing or new 
key/value pair in a keyrange K in one client’s bucket 
(b.) to be linked to another client’s bucket (b,’), where 
a key/value pair linked to another bucket may not be 
modified or deleted. UNLINK removes such a link. VER- 
IFyY checks that the SSP stores a value with hash H 
for key k in bucket b,. Then, if a client links to other 
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clients’ buckets when it joins the system and it veri- 
fies an update’s value before accepting the update into 
its history, we can effectively restore unanimous consent 
for garbage collecting versions (§5.6). Second, to assure 
clients that updates are sufficiently replicated, the SSP 
could return a receipt in response to LPUT that the clients 
could use like receipt sets in standard Depot (§5.3). 
These extensions seem plausible. Others have proposed 
receipts [38, 58, 62, 74], and the proposed LINK and UN- 
LINK calls have correlates on Unix file systems, suggest- 
ing utility beyond Teapot. 

This discussion illustrates that clients can use an SSP- 
supplied key-value map as a black box to recover most of 
Depot’s properties. To recover all of them, the SSP needs 
to be incrementally augmented not to delete prematurely. 


9 Related work 


We organize prior work in terms of trade-offs between 
availability and fault-tolerance. 

Restricted fault-tolerance, high availability. A num- 
ber of systems provide high availability but do not tol- 
erate arbitrary faults. For example, key-value stores in 
clouds [16, 21, 22] take a pragmatic approach, using 
system structure and relaxed semantics to provide high 
availability. Also, systems like Bayou [67], Ficus [61], 
PRACTI [10], and Cimbiosys [60] can get high avail- 
ability by replicating all data to all nodes. Unlike Depot, 
none of these systems tolerates arbitrary failures. 

Medium fault-tolerance, medium availability. An- 
other class of systems provides safety even when only 
a subset (for example, 2/3 of the nodes) is correct. How- 
ever, the price for this increased fault tolerance compared 
to the prior category is decreased liveness and availabil- 
ity: to complete, an operation must reach a quorum of 
nodes. Such systems include Byzantine-Fault Tolerant 
(BFT) replicated state machines (see [15, 19, 30, 33]) 
and Byzantine Quorums [46]. Note that researchers are 
keenly interested in reducing trust: compared to clas- 
sic BFT systems, the recently proposed A2M [17], 
TrInc [42], and BFT2F [44] all tolerate more failures, the 
former two by assuming trusted hardware and the latter 
by weakening guarantees. However, unlike Depot, these 
systems still have fault thresholds, and none works dis- 
connectedly. PeerReview [31] requires a quorum of wit- 
nesses with complete information (hindering liveness), 
one of which must be correct (a trust requirement that 
Depot does not have). 

High fault-tolerance, low availability. In fork-based 
systems, such as SUNDR [43] and FAUST [12], the 
server is totally untrusted, yet even under faults provides 
a safety guarantee: fork-linearizability, fork-sequential 
consistency, etc. [54]. However, these systems provide 
reduced liveness and availability compared to Depot. 
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First, in benign runs, their admittedly stronger seman- 
tics means that they cannot be available during a network 
partition or server failure. Second, after a fork, nodes are 
“stranded” and cannot talk to each other, effectively stop- 
ping the system. A related strand of work focuses on ac- 
countability and auditing (see [38, 58, 62, 74]), providing 
proofs to participants if other participants misbehave. All 
of these systems detect misbehavior, whereas our aim is 
to tolerate and recover from it—which we view as a re- 
quirement for availability. 

Systems with similar motivations. Venus [63] allows 
clients not to trust a cloud storage service. While Venus 
provides consistency semantics stronger than Depot’s 
(causal consistency for pending operations, lineariz- 
ability for completed operations (roughly)), it makes 
stronger assumptions than Depot. Specifically, Venus re- 
lies on an untrusted verifier in the cloud; assumes that a 
core set of clients does not permanently go offline; and 
does not handle faulty clients, such as clients that split 
history. SPORC [24] is designed for clients to use a sin- 
gle untrusted server to order their operations on a sin- 
gle shared document and provides causal consistency for 
pending operations (and stronger for committed opera- 
tions). Unlike Depot, SPORC does not consider faulty 
clients, allow clients to talk to any server, or support arbi- 
trary failover patterns. However, SPORC provides innate 
support for confidentiality and access control, whereas 
Depot layers those on top of the core mechanism. 

A number of other systems have sought to minimize 
trust for safety and liveness. However, they have not 
given a correctness guarantee under arbitrary faults. For 
example, Zeno [64] does not operate with maximum live- 
ness or minimal trust assumptions: it assumes f + | avail- 
able servers per partition, where f is the number of faulty 
servers. TimeWeave [47] ensures that correct nodes can 
pass the blame of any mal-activity to culprit nodes, and 
S2D2 [35] uses tamper-evident history summaries to de- 
tect forks. However, unlike Depot, these two systems 
neither repair forks nor target cloud storage (which re- 
quires addressing staleness, durability, and recoverabil- 
ity). Other systems target scenarios similar to cloud stor- 
age but do not protect consistency [28, 34, 65]. 

Some systems have, like Depot, been designed to re- 
sist large-scale correlated failures. Glacier [32] can tol- 
erate a high threshold, but still no more than this thresh- 
old, of faulty nodes, and it stores only immutable objects. 
OceanStore [39] is designed to minimize trust for dura- 
bility but does not tolerate nodes that fail perniciously. 

Distributed revision control. Distributed repositories 
like Git [27], Mercurial [49], and Pastwatch [73] in- 
corporate a data model similar to Depot’s, and could 
be augmented to resist faulty nodes (for example, forc- 
ing clients to sign updates in Git would prevent servers 
from undetectably altering history). However, all of these 


systems are fundamentally geared toward replicating a 
source code repository. Our context brings concerns that 
these systems do not address, including how to avoid 
clients’ storing all data, how to perform update exchange 
in this scenario, how to provide freshness, how to evict 
faulty nodes, how to garbage collect, etc. 


10 Conclusion 


Depot began with an attempt to explore a radical point 
in the design space for cloud storage: trust no one. Ulti- 
mately we fell short of that goal: unless all nodes store 
a full copy of the data, then nodes must rely on one an- 
other for durability and availability. Nonetheless, we be- 
lieve that Depot significantly expands the boundary of 
the possible by demonstrating how to build a storage sys- 
tem that eliminates trust assumptions for safety and min- 
imizes trust assumptions for liveness. 
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Abstract 


Distributed key-value storage systems are widely used in 
corporations and across the Internet. Our research seeks to 
greatly expand the application space for key-value storage sys- 
tems through application-specific customization. We designed 
and implemented Comet, an extensible, distributed key-value 
store. Each Comet node stores a collection of active storage 
objects (ASOs) that consist of a key, a value, and a set of han- 
dlers. Comet handlers run as a result of timers or storage oper- 
ations, such as get or put, allowing an ASO to take dynamic, 
application-specific actions to customize its behavior. Handlers 
are written in a simple sandboxed extension language, provid- 
ing properties of safety and isolation. 

We implemented a Comet prototype for the Vuze DHT, de- 
ployed Comet nodes on Vuze from PlanetLab, and built and 
evaluated over a dozen Comet applications. Our experience 
demonstrates that simple, safe, and restricted extensibility can 
significantly increase the power and range of applications that 
can run on distributed active storage systems. This approach fa- 
cilitates the sharing of a single storage system by applications 
with diverse needs, allowing them to reap the consolidation ben- 
efits inherent in today’s massive clouds. 


1 Introduction 


The last decade has seen the rise of distributed stor- 
age systems built on loosely coupled collections of au- 
tonomous computers. For example, Amazon’s S3 [3] 
provides a key-value storage service for external Web 
clients. Amazon’s Dynamo [17], Apache Cassandra [5], 
and Project Voldemort [38] provide reliable and scalable 
key-value stores for company-internal applications (for 
Amazon, Facebook, and LinkedIn, respectively). On the 
global Internet, DHTs provided by BitTorrent-based sys- 
tems, such as Vuze [58] and uTorrent [56], store metadata 
for millions of clients using peer-to-peer file-sharing ap- 
plications. And finally, researchers have developed com- 
plete file systems on top of untrusted clients in widely 
distributed P2P environments [2, 14, 44]. 

Distributed storage systems offer many advantages 
over their centralized counterparts. For example, a de- 
centralized structure supports scalability; the lack of cen- 
tralized management enhances automatic load balancing; 
and the use of replication in a highly distributed environ- 
ment can improve reliability and data availability. We 
therefore expect Dynamo-like storage systems to become 
commonplace as generic application infrastructures in the 
future, both inside of the enterprise and as shared services 
on the Internet. 


A significant limitation of such systems for generic 
application support, however, is that different applica- 
tions have different needs. As an example, each Dynamo 
application inside of Amazon runs its own Dynamo in- 
stance [17], even though a single instance might be log- 
ically better and more resource efficient. In our own 
work on Vanish [25] — a security-oriented DHT applica- 
tion — we needed to make application-specific parame- 
ter and policy changes to Vuze (a million-node commer- 
cial DHT) in order to harden it against attack. While 
these changes were conceptually simple, e.g., modi- 
fying the storage replication algorithm, deploying our 
changes took months of work with Vuze’s DHT designer. 
Other Vuze applications may wish to make their own 
application-specific changes or enhancements, but doing 
so is neither feasible nor supportable, and it doesn’t scale. 
We believe that with the huge consolidation benefits of 
shared cloud storage services, either inside or outside of 
the enterprise, supporting specialization of storage ser- 
vices can have high payoffs in the future. 


This paper presents Comet, a next-generation, flexi- 
ble, distributed storage system, which opens the world 
of distributed storage to a new set of more complex stor- 
age applications. In particular, Comet permits multiple 
applications to share a single Comet instance, while en- 
abling each application to change the behavior of its stor- 
age elements to suit its own requirements. For example, 
a storage element can make decisions based on its access 
history, its current number of replicas, the time of day, 
etc. Therefore Comet can easily support different stor- 
age lifetimes, access methods, access control schemes, or 
replication schemes for different storage-element types, 
in a way that makes them easy to deploy and test. Using 
Comet, we can also carry out interesting measurement- 
based experiments from within the DHT. 


Comet implements active storage objects (ASOs). An 
active storage object consists of a key, an associated value 
(an untyped blob), and optionally, a set of simple han- 
dlers. An ASO’s handlers execute as a result of com- 
mon storage events on the object (such as get and put) 
or from timer events that its handlers request. As a result, 
an ASO can modify its environment, monitor its execu- 
tion, and make dynamic decisions about its state. 


The design of an extensible system for this environ- 
ment presents a set of interesting design questions. For 
example, what features should the system provide for ap- 
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plications and which can (and should) be left out? What 
is the proper tradeoff between power and safety? How 
can client nodes be confident that active storage objects 
will not cause damage or interference? How can we pre- 
vent the use of active storage objects to mount a DDoS at- 
tack? And overall, how can we extend the storage system 
without losing its principal characteristics? Our Comet 
design considers these and other issues. 

The remainder of this paper describes our goals, ar- 
chitecture, experience, and evaluation of Comet. To pro- 
vide concrete insight into Comet’s design and potential, 
we implemented a Comet prototype and used it to cre- 
ate and deploy a set of over a dozen Comet applications. 
Our prototype leverages Vuze: each Comet instance is 
an extended Vuze client that can execute Comet active 
storage objects while also serving as a full participant 
in the million-node Vuze DHT. Comet applications are 
written in Lua — a common application-extension lan- 
guage. We modified the Lua runtime to meet our iso- 
lation and safety requirements, providing a safe sandbox 
for handler execution. To test our applications we ran our 
Comet clients from several hundred PlanetLab nodes and 
measured their behavior. Overall, our experience demon- 
strates that a highly restrictive but active distributed stor- 
age system can provide significant power to simultane- 
ously support applications with diverse storage needs. 


2 Related Work 


The concept of extensible systems has been widely ex- 
plored in the past in several domains. Extensible operat- 
ing systems have been proposed that support application- 
specific needs [6, 46, 28]. Active networks allow code 
to be downloaded along with network data and executed 
within the network infrastructure (e.g., on routers) to ex- 
tend network services [60, 54]. Active messages execute 
a small amount of user code with each message recep- 
tion [57]. Click explored the design of an extensible 
router [30]. Database triggers allow applications to define 
procedural code that is executed in response to database 
operations [35]. 

In the context of storage systems, Watchdogs [7] ex- 
tends the Unix file system, allowing a user-mode process 
to interpose on file operations for specific files to change 
access semantics. Several projects have proposed the 
integration of CPUs and disks to create intelligent disk 
storage systems that can provide on-board application- 
specific functions, e.g., for decision support systems, data 
mining, and image processing [29, 41, 1]. 

DHTs are increasingly used to support a variety of dis- 
tributed applications, such as file-sharing, distributed re- 
source tracking, end-system multicast, publish-subscribe 
systems, distributed search engines, and even data-center 
applications. Some of these systems (e.g., as CFS [14], 
13 [52], and PAST [44]) can be implemented using the 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


traditional put/get interface, but many others (e.g., Mer- 
cury [8], CoralCDN [21], Scribe [45], and Bayeux [64]) 
require customized interfaces and are implemented by 
altering the underlying DHT mechanisms in significant 
ways. Our work provides the ability to extend a DHT 
without requiring a substantial investment of effort to 
modify its implementation. 

Deployed DHTs don’t currently offer good semantics 
and security. However, people do know how to make 
them consistent [32, 34] and harden then against at- 
tacks [11, 16, 48, 26, 59]. The reason DHTs do not cur- 
rently implement these techniques is that there has not yet 
been a deployed application that truly needed strong se- 
mantics and security. For example, the Vuze design per- 
ceived many threats as irrelevant [23] and deployed few 
defenses against them. However, after the new, more de- 
manding Vanish application was proposed [25], the Vuze 
DHT responded by embracing a variety of effective secu- 
rity measures. In addition to enabling new applications 
atop DHTs, we hope to drive the design of these systems 
towards well-understood, yet unadopted levels of security 
and consistency. 


3 Goals 


Comet is a distributed key-value storage system. Like 
other such systems, a Comet storage object is a 
<key,value> pair. Unlike previous systems, however, 
Comet’s design facilitates extensible, active storage ob- 
jects. A Comet application performing a put can there- 
fore include, along with a key and value, a small set of 
handlers for that object. The node receiving the put 
stores the handlers along with the key and value, registers 
the handlers for events that they specify, and executes the 
handlers when their respective events occur. 
Comet’s system goals are: 


1. Flexibility. Comet should be easily customizable to 
achieve our target functions described below. 


2. Isolation and safety. A client node running Comet 
should be protected from the execution of handlers 
(e.g., an executing handler cannot corrupt the node or 
use unlimited resources). Handlers should not be able 
to mount messaging attacks on other nodes. 


3. Performance. The performance of gets/puts ona 
Comet ASO with null handlers should be the same 
as on a non-active system, and execution of handlers 
should have only negligible performance impact. 


Isolation and safety are particularly important to our 
architecture. While Comet can be used in different envi- 
ronments, we designed it to enable wide-scale, outside- 
the-firewall deployment on autonomous nodes, similar to 
P2P systems and DHTs. Users downloading Comet must 
trust it and have guarantees about its behavior. For this 
reason, Comet enforces four important restrictions: 


USENIX Association 


USENIX Association 


1. Limited knowledge: an ASO is not aware of other ob- 
jects or resources stored on the same node and has no 
direct way to learn about them. 


2. Limited access: an object handler can manipulate 
only its own value and cannot modify the values of 
other objects on its storage node. 


3. Limited communication: an active storage object can- 
not send arbitrary messages over the network. 


4. Limited resource consumption: an ASO’s resource 
usage is strictly bounded, e.g., the system limits the 
amount of computation and memory it can consume. 


We are specifically not attempting to build a general- 
purpose distributed programming system, such as Planet- 
Lab [4, 36]; such a system would be unacceptable in our 
target environment and inappropriate (and unnecessary) 
for our needs. Rather, our goal is to support relatively 
simple specializations or actions on simple storage ob- 
jects. Even very simple specializations can provide a sig- 
nificantly more powerful storage system that enables new 
types of applications. We therefore take a lightweight and 
limited approach. As examples, an ASO should be able 
to perform the following functions: 


e Statistics gathering. Collect statistics about its use, 
e.g., by counting the number of gets and puts. 


e Information tracking. Log information, such as a list 
of IPs that performed get operations on its value or a 
recent history of the values it stored. 


e Time awareness. Take time-based actions, e.g., to 
make periodic changes to its state or self-destruct af- 
ter a timer has elapsed. 


e Location awareness. Make location-based decisions, 
e.g., choosing where to store based on nodes’ network 
locations. 


e Access control. Implement simple access control 
policies on its own. 


e Replication. Implement different replication policies. 


e Storage system measurement. Provide insight into the 
behavior of the distributed storage system as seen by 
clients executing within the system itself. 


As we shall see, the only long-term state available to a 
handler is its object’s value; therefore, any logs, counts, 
etc., must be stored as part of that value. However, an ac- 
tive object can choose to report only a subset of its stored 
value record on a get, or it can selectively report different 
values to different callers based on call parameters. 


The following sections describe Comet’s architecture. 
In particular, we discuss the tradeoffs required to provide 
flexibility while also achieving isolation and safety. 


4 Architecture and Implementation 


This section describes Comet’s active storage architec- 
ture and prototype implementation. One could imagine 
running Comet in various environments, e.g., an inside- 
the-firewall corporate deployment or a distributed envi- 
ronment with autonomous untrusted nodes. We focus our 
current architecture and prototype on the latter. 


4.1 Architecture 


Figure 1(a) shows the high-level architecture of our 
Comet distributed storage system. The Comet storage 
system consists of three basic components. First is the 
routing substrate (Figure 1(a) bottom), which imple- 
ments the value/node mapping, allowing a client to find 
nodes that store specific data items. In the case of a DHT, 
for example, the routing substrate typically applies a hash 
function to the key to compute the IDs of nodes that store 
the associated value. However, other routing substrates 
may locate values in other ways. 

The second component is the key-value store, which 
maintains a set of key-value pairs on each node. A key- 
value storage system typically exports a simple get /put 
interface. While existing storage systems store arbitrary, 
untyped byte strings, the Comet storage system stores ac- 
tive storage objects (ASOs). An ASO consists of a key 
and its associated state (i.e., a value, stored as an untyped 
byte string), along with optional code that operates on 
that state. The code is structured as a set of handlers that 
specify how the object behaves, i.e., how it modifies its 
state when certain events occur. For example, an ASO’s 
onGet handler is invoked whenever a remote client per- 
forms a get operation to access an object. This handler 
might perform some simple operation, such as increment- 
ing a counter for the number of gets or appending the 
client’s IP address to a log structure. The counter or the 
log structure would be stored as part of the ASO’s state 
that can be accessed by the handler. 

The third architectural component is the active runtime 
system. The runtime system handles ASO invocations 
and provides the security policy and execution environ- 
ment. An application running on a remote client specifies 
the initial state and handlers for an ASO when initially 
storing the object via a put operation. When a client per- 
forms a get or a put, it can optionally request a cryp- 
tographic checksum of the code associated with the tar- 
get ASO. This can serve as an integrity check that the 
client’s initial put is to a key with no associated ASO 
and that subsequent operations are performed on ASOs 
created by the application. In most implementations, a 
Comet node distrusts remote nodes and client applica- 
tions; therefore, the runtime component of the active sub- 
system implements and enforces an ASO execution sand- 
box (Figure 1(a), top). Our Comet prototype uses a lan- 
guage sandbox based on Lua [43] to prevent a handler 
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(a) Architecture. 
Figure 1: Comet Architecture and APIs. 
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get(key, [args]) — value, nodes storing copies 
put(key, value[, nodes]) 
lookup(key) — nodes closest to a key 


(b) ASO API. 


(a) depicts the decomposition of a Comet node into two vertical components - the 


core Comet code, which is trusted from the node’s perspective, and the ASO code which is arbitrary and, therefore, untrusted. (b) 


details the API exposed to ASOs. 


from accessing outside state and to constrain the ASO 
from consuming too many computational and memory re- 
sources on the host. The ASO runtime consults a security 
policy module, which specifies all execution limits. 

While some applications may be satisfied by an en- 
tirely sandboxed execution, many would benefit from an 
ASO’s limited ability to interact with or “sense” its en- 
vironment. For example, to implement the conditional 
replication scheme we added to Vuze for Vanish, an ASO 
requires knowledge of the number of replicas in the DHT 
and the time of day (to enforce the desired minimum 
replication interval). For this reason, the active subsys- 
tem exposes a small API (called the ASO API) to the 
handlers. 


4.2 Active Storage Object API 


Table 1 and Figure 1(b) show the handler and ASO run- 
time APIs, respectively. The handler API supports invo- 
cations based on the primary storage functions — put, get 
— as well as an onTimer handler to be executed period- 
ically (e.g., once every 10 minutes) during the object’s 
lifetime. For example, an ASO could directly implement 
a custom replication policy in its onTimer handler. 

The ASO runtime API is the only way for an ASO 
to interact with its environment outside of the sandbox. 
Our design supports two types of useful interactions: (1) 
obtaining information about the local node, and (2) ex- 
ecuting various storage system operations. The former 
category includes functions to obtain the time of day, the 
hosting machine’s external IP address, etc. The latter in- 
cludes functions to interact with other storage system ob- 
jects. The ASO API was not designed to be entirely gen- 
eral; rather, our goal was to provide a minimal interface, 
informed in part by our requirements of security, privacy, 
and isolation. We tested this interface by implementing 
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and running over a dozen applications on our Comet pro- 
totype. Interestingly, we were able to build a relatively 
diverse set of applications with a surprisingly small in- 
terface, which has remained relatively stable through the 
project. This suggests that a small interface, like the one 
shown in Figure 1(b), can support a wide variety of appli- 
cations. Naturally, there are limitations. For example, we 
explicitly prohibit any direct network-level interactions 
with remote nodes on the Internet. While this feature 
might be desirable to certain measurement applications, 
its DDoS implications would be unacceptable. 





onGet(caller[, callbackID, payload]) 

Invoked when a get is performed on the ASO. Returns a value which will 
be passed back to the caller. Instead of returning a value immediately, the 
handler could also perform a put at the optional callbackID sometime in the 
future. The handler also takes an optional payload argument of arbitrary 
type. 

onPut(caller) 

Invoked upon initial put when the object is created. Returns the value that 
should be stored by the node (e.g., itself or nil). 

onUpdate(new_value, caller) 

Invoked on an ASO when a put overwrites an existing value. Returns the 
value that should be stored, e.g., new_value if it should be replaced, or itself 
if not. 

onTimer() 

Invoked periodically. This handler has no return value. It is used to perform 
periodic maintenance such as replication. 




















Table 1: ASO Handler Calls. 


4.3 Language Based Sandbox 


Our Comet prototype focuses on a DHT environment 
composed of a large number of untrusted autonomous 
nodes that cooperate to support the distributed active stor- 
age system. In this environment, the key challenges in- 
clude providing a strong sandbox and limiting ASO re- 
source consumption. We briefly describe how our system 
addresses these challenges using a language based sand- 
box. 
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The Comet prototype required an ASO programming 
environment that reflected our needs for simple extensi- 
bility, flexibility, performance, isolation, and safety. To 
meet these needs, we chose Lua [43], a lightweight and 
easily constrained scripting language. A dynamically 
typed, imperative and functional programming language, 
Lua is most commonly used for coding application ex- 
tensions. In this context, it lets users add or modify fea- 
tures in video game engines, Web servers, version control 
systems and other applications (specific examples include 
World of Warcraft, SimCity 4, Adobe Photoshop Light- 
room, and Squeezebox Jive Platform). Several properties 
make Lua well suited for implementing ASOs. First, it 
employes a small set of programming constructs (includ- 
ing first-order functions) and a small number of data types 
(including tables, which are heterogeneous associative ar- 
rays). Second, Lua compiles to simple bytecode, which 
makes it relatively easy to sandbox. Finally, ASOs writ- 
ten in Lua are concise and small when serialized; the Lua 
ASOs we implemented are all under 1.5KB, about five to 
ten times smaller than Java equivalents. 


Comet represents ASOs as Lua tables that encapsulate 
both persistent state and the handlers to be invoked on 
that state. Lua tables can implement basic arrays, asso- 
ciative arrays, or both. While an associative array can 
contain any name-value mappings, we treat certain asso- 
ciations as handlers. In particular, if the ASO table con- 
tains an associative array with the names “onGet,’ “on- 
Put,” “onUpdate,’ or “onTimer” — and those names are 
associated with values that are Lua functions — then the 
runtime invokes those functions when the corresponding 
events occur. Our runtime system serializes Lua tables 
into a byte stream for transmission to a storage node on a 
put request. 

We made several modifications to the standard Lua in- 
terpreter for the Comet runtime system. We sandbox 
ASOs by removing all but the core libraries from the 
runtime, leaving only a math package, string manipula- 
tion, and table manipulation. As a result, handlers are 
extremely restricted: they have no direct network access, 
no system execution capabilities, no thread creation capa- 
bilities, and no file system access. We also strictly bound 
the amount of resources that a handler can consume. For 
example, the runtime limits both the number of bytecode 
instructions that a handler can execute and the amount of 
memory it can consume. If a handler exceeds either of 
these limits, the runtime terminates its execution. 

The Comet runtime exposes a DHT wrapper object to 
handlers, which allows an ASO to communicate with its 
environment. The ASO can learn information about the 
hosting node, including the external IP address and the 
current system time. It can also perform a restricted set 
of DHT operations. For example, it can perform get and 
put operations on replicated copies of its value stored at 


other nodes. In the API presented in Section 4.2, these 
operations return values or neighboring node IDs. How- 
ever, since these operation are slow in the DHT setting 
and may block for seconds or even minutes, we chose to 
implement them using function callbacks. Each such op- 
eration takes an optional parameter, a function which ac- 
cepts the result as its parameter. For example, instead of 
returning a value, a get operation takes a function which 
is eventually passed the result of the operation. The op- 
eration returns immediately with no value, and the get 
is actually performed after the ASO execution has com- 
pleted. While this presents a slightly different paradigm 
to the user, we think this provides a greater ability to op- 
timize the performance of Comet-based applications. 


4.4 Comet Prototype Implementation 


We built the Comet prototype on the Vuze DHT, which 
supports the widely used Vuze BitTorrent client. The 
DHT is used mainly for distributed tracking of torrents; 
however it has been used in research as well [27, 25]. 

Vuze implements the Kademlia routing protocol, in 
which each node is assigned a 160-bit ID based on the 
SHALL hash of its IP address and port. Basic DHT opera- 
tions (get, put, and remove) take a 160-bit key, perform 
a lookup to find nodes whose ID is close to that key, and 
then send a read or store RPC to those nodes. 

We minimally extended the Vuze interface to conform 
to Comet’s abstract operations. For example, we aug- 
mented get to allow a caller to pass an arbitrary byte- 
string argument. This supports a parameterized get op- 
eration, where the ASO can return different values de- 
pending on the parameter (analogous to the semantics for 
GET in HTTP). 

Allowing extensibility in a DHT environment creates 
challenges, e.g., it has the potential to provide a platform 
for DDoS attacks. Therefore, in addition to the Lua re- 
source restrictions described previously, we limit DHT 
communications that ASOs can perform in two ways. 

First, we do not allow an ASO to perform operations on 
arbitrary DHT keys or nodes, but rather only on specific 
key-node pairs. An ASO may communicate with any of 
its neighboring nodes that are responsible for replicas of 
the ASO. We also allow the ASO to communicate with 
key-node pairs that have interacted with it in the past, 
once for each such interaction. To enable this function- 
ality, we extended Comet requests to include the ID of 
the requesting node and the ID of a local key contained 
within the node. If an ASO receives a get request with 
a key ID specified, it gains the capability for a one-time 
operation on that key to the node that issued the request. 
The ASO can then either return a value immediately and 
exhaust its one-time capability, or save that capability 
for future use. This mechanism allows applications to 
respond to DHT requests at a future point in time, es- 
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pecially if the requested data is not currently available. 
We do not allow ASOs to pass these capabilities between 
each other as doing so would enable a malicious node 
to mount DDoS attacks. In Section 5 we discuss signed 
ASOs, which do not have these restrictions. 

Second, Comet imposes rate limits on the number of 
messages generated by an ASO, either to neighboring 
nodes storing replicas or to arbitrary key-node pairs that 
have interacted with it in the past. This prevents misbe- 
having ASOs from exhausting the bandwidth resources of 
the Comet nodes hosting them. We discuss these security 
issues further in Section 7. 


5 Applications 


This section seeks to demonstrate both the range of stor- 
age behaviors that Comet can support and the ease with 
which those behaviors can be implemented. To do this, 
we describe several of the active storage applications 
we have implemented, deployed, and measured on our 
Comet PlanetLab prototype. We provide code snippets to 
show how simply these actions can be programmed in our 
Lua-based ASO environment. In Section 6, we present 
measurements from some of these examples. 


5.1 Customizable Replication 


Most DHTs specify a fixed replication policy for stored 
values, requiring applications to conform to that pol- 
icy. In contrast, Comet ASOs can provide their own 
application-specific replication mechanisms, e.g., con- 
trolling the replication factor, the replication interval, and 
the choice of nodes on which the object will be repli- 
cated. This flexibility is useful for applications that place 
varying degrees of emphasis on performance, availabil- 
ity, locality, and security. For instance, a security sensi- 
tive application (such as Vanish) might use a small num- 
ber of replicas and long replication intervals, limiting the 
dispersion of its objects stored in the DHT. On the other 
hand, an application that values availability might repli- 
cate frequently to a large number of nodes. 

Listing 1 shows how an ASO can define a customized 
replication policy. In this example, the onTimer han- 
dler wakes up periodically, invokes lookup to deter- 
mine a list of nodes closest to the ASO’s key, executes 
selectGoodNodes! to identify a subset of nodes that will 
serve as replicas, and then stores a copy of itself on the 
selected nodes using put. We have also implemented a 
timer handler that replicates only when the number of ex- 
isting replicas falls below a certain threshold; this lowers 
communication overhead and mitigates data harvesting 
attacks for security sensitive applications, reflecting the 
changes we made to Vuze after we published Vanish [25]. 


The Lua code for selectGoodNodes is omitted for brevity. It im- 
plements an application-specific policy for choosing replicas. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 





function aso:handleLookup(nodes) 
nodes = self.selectGoodNodes(nodes) 
dht.put(dht.getKey(), self, nodes) 
end 


function aso:onTimer() 
dht.lookup(dht.getKey(), self-handleLookup) 
end 





Listing 1: Smart Replication 
5.2 Controlling Data Access 


Comet objects can implement various policies that con- 
trol how data stored in the objects is accessed. We illus- 
trate a few such examples. 


Timeouts and Limited-read values: ASOs can be 
used to implement objects that will be accessible for only 
a limited, application-specified time. Such objects are 
meaningful for security applications such as Vanish [25], 
which provide support for self-destructing digital data by 
storing cryptographic keys in a DHT. 

Listing 2 shows the handler code required to imple- 
ment application-specific timeouts. Each replica stores a 
timestamp when the object is created (stored) and then 
deletes the object after 60 minutes using a timer handler. 
In addition, the onGet handler prevents the object’s con- 
tents from being accessed after the timeout but before it 
is deleted by a timer handler. 


function aso:onPut(value) 
self.timeout = dht.getSystemTime() + 60*MINUTES 
return self 
end 
function aso:onTimer() 
if (dht.getSystemTime() > self.timeout) then 
—— delete local ASO 
dht.deleteSelf() 
end 
end 
function aso:onGet() 
if (dht.getSystemTime() > self.timeout) then 
—— delete local ASO 
dht.deleteSelf() 
return nil 
end 
return self 
end 


Listing 2: Timeouts 


An ASO can also choose to delete itself after it has 
been read — providing a “limited-read value” — where 
each replica can be read at most once. In addition to 
its use for self-destructing data, limited-read values could 
be used in settings where objects represent tasks and are 
deleted once they have been claimed by worker nodes. 
The object then serves as a synchronizing construct be- 
tween the task’s producer and consumer. 
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Listing 3 implements limited-read values. When a get 
is performed, the node records the fact that the value has 
been read. It then propagates the request to every other 
replica by overwriting them with nil. Note that the ob- 
ject does not delete itself immediately, but rather stays 
around for a while and periodically attempts to delete 
other replicas to ensure that copies on nodes with tran- 
sient connectivity issues [22] are eventually deleted. Note 
also that concurrent gets issued to different replica nodes 
might successfully read the value. In general, as with 
other distributed storage systems, consistent update of 
replicated values would require the use of heavy-weight 
consensus operations. Comet does not currently provide 
such primitives. ASO handlers do however provide the 
ability for replicas to detect and correct inconsistencies, 
e.g., ASOs can compare and reconcile replica contents 
through periodic invocations of the onTimer handler. 


function aso:onGet() 
if (self.read) then return nil end 
self.read = dht.getSystemTime() + 30* MINUTES 
dht.put(dht.getKey(), nil) ——deletes replicas 
return self 

end 

function aso:onTimer() 


if (self.read) then 
dht.put(dht.getKey(), nil) ——deletes replicas 
if (dht.getSystemTime() > self.read) then 
dht.deleteSelf() 
end 
end 
end 


Listing 3: Limited-Read Values 


Data Subscription: An ASO can allow clients to “sub- 
scribe” so that they will be notified when the ASO re- 
ceives a new value. In Listing 4, when the subscriber 
performs a get, the ASO saves the subscriber’s network 
location (callerNode) and a key that will serve as the 
subscriber’s recipient of the value (callbackKey). When 
a value update occurs, the ASO distributes the value to 
all registered subscribers — the runtime ensures that the 
ASO distributes these values only to clients who have 
actually performed a get on the ASO. In the example 
shown, the ASO clears its subscriber list after its put op- 
erations; subscribers must then re-subscribe if they’re still 
interested. Later we will describe an implementation of a 
scalable publish-subscribe scheme based on this design. 


Sensitive values: ASOs can implement various forms 
of access control policies. For instance, Listing 5 pro- 
vides read access to the object’s value only if the client 
can present a predetermined password akin to a feature 
already provided by some DHTs, like OpenDHT [40]. A 
client provides the password as an argument to the get 





aso.pending = {} 
function aso:onGet(callerNode, callbackKey) 
if(self. value) then 
return self.value 
end 
self.pending[callerNode] = callbackKey 
return nil 


end 
function aso:onUpdate(callerNode, value) 
self.value = value 
for callerNode,key in pairs(self.pending) do 
dht.put(key, value, {callerNode}) 
end 
self.pending = {} 
end 





Listing 4: Pub-sub 
request. 

There are a few issues with the code provided above, 
especially if it were to be extended to support password- 
protected updates. A malicious node could claim to store 
the object but simply serve as a proxy for clients’ requests 
and thereby implement man-in-the-middle attacks. This 
could be solved by exposing basic encryption primitives 
to the ASO, like a secure hash function and/or public key 
cryptographic primitives. For example, instead of passing 
the plaintext password to the ASO, the client hashes the 
concatenation of the password with its IP/port, thus the 
ASO can verify that the request is not being forwarded 
by a malicious node. The ASO’s security can be further 
strengthened by public/private key pairs, with the ASO 
storing the public key and clients authenticating them- 
selves by presenting a message signed with the corre- 
sponding private key. With these enhancements, a ma- 
licious node storing a copy of the object cannot overwrite 
the contents of other replicas since it doesn’t possess the 
private key. 


function aso:onGet(caller, callerld, password) 
if (password == ‘‘mypass1234’’) then 
return ‘‘Well kept secret’’ 


end 
return nil 
end 





Listing 5: Password 


An application could use multiple mechanisms for 
controlling data access, e.g., it could use timeouts in con- 
junction with password-protected access. While Comet 
does not allow ASOs to register multiple handlers for a 
given storage operation, the developer can combine all of 
the desired mechanisms into a single handler. Though 
this might increase programming complexity, it allows 
the application developer to control how different mech- 
anisms interact with each other and provides the basis for 
a predictable and deterministic execution model. 
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5.3. Measurements and Monitoring 


DHT Measurements: ASOs provide a platform for 
instrumenting and measuring the DHT using the DHT 
nodes themselves. This enables a more detailed and com- 
prehensive view of the DHT and helps provide accurate 
estimates of DHT properties such as churn, node lifetime 
distribution, transient inconsistencies, etc. 

For instance, Listing 6 tracks the k closest nodes to the 
ASO and stores the information it learns as part of the 
object state. A measurement application can create ob- 
jects of this type, store them at multiple locations within 
the DHT, and obtain snapshots of DHT membership by 
retrieving the objects’ contents using get operations. 


aso.neighbors = {} 
function aso:handleLookup(nodes) 
self neighbors[dht.getS ystemTime()] = nodes 


end 

function aso:onTimer() 
dht.lookup(dht.getKey(), self-handleLookup) 

end 


Listing 6: Lifetime 


While this measurement could be performed by nodes 
that are not part of the DHT (as in earlier work [20, 50]), 
measurements from within the DHT can provide more ac- 
curate data. For example, the lifetime measurement could 
be carried out by a client that interactively crawls the 
routing tables of the DHT nodes and then uses heartbeat 
messages to monitor the uptimes of the nodes it learns 
about. This approach could provide faulty data, however, 
if the DHT contains firewalled nodes that do not receive 
or respond to such heartbeat messages.” On the other 
hand, firewalled nodes still communicate with neighbors, 
for example to replicate values. Therefore, measurements 
performed from ASOs within the DHT can be more ac- 
curate, as we will demonstrate later. 


Monitoring uses: An ASO can also maintain audit 
trails, e.g., indicating where it has been stored thus far, 
who has read or updated the object, etc. Such tasks are 
particularly useful for debugging and aid in rapid proto- 
typing. For example, this may help a developer to learn 
whether a new ASO replication mechanism is operating 
properly. Alternately, logs can also be used for forensics. 
Listing 7 illustrates a monitoring application that tracks 
the nodes storing and accessing a value. 

This specific implementation comes with a few 
caveats. Each replica may have a different view of the 
list of nodes that have stored or read the value. To address 
this, the experimenter needs to get the union of the lists 
stored in all the replicas, consolidating them as a post- 
processing step. 


21n fact, about half the nodes in P2P DHTs are firewalled [23]. 
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replicalps, hostIps, accessorIps = {} 

function aso:onGet(callerIp) 
table.insert(self.accessorIps, callerIp) 
return self 

end 

function aso:onPut(caller) 
table.insert(self.accessorIps, caller.getIP()) 
table.insert(self-hostIps, dht.localNode.getIP() 
return self 

end 

function aso:handlePut(nodes) 
for i,v in ipairs(nodes) do 

table.insert(self.replicalps, v.getIp() 

end 

end 

function aso:onTimer() 
dht.put(dht.getKey(), self, 20, self.handlePut) 

end 


Listing 7: Monitoring 
5.4 Smart Rendezvous 


DHTs are used for rendezvous in many distributed sys- 
tems. In P2P file-sharing systems such as BitTorrent, the 
DHT is used as a distributed tracker either with or as a 
replacement for a centralized tracker. That is, peers that 
want to download a particular file use the DHT to iden- 
tify other peers who are downloading or sharing the file. 
The downside with current DHT-based distributed track- 
ers, however, is that they result in random overlay con- 
nections, as there is no mechanism to enforce more intel- 
ligent peer-matching techniques. 

With Comet we can address this limitation by using 
ASOs to track participating nodes, as well as construct 
peer lists that are optimized for a requesting node. Peers 
could be matched in order to lower inter-node laten- 
cies [33], maximize reciprocation probability based on 
peer bandwidths [37], or lower ISP costs [62, 12]. We 
have implemented one such matching scheme that uses 
the nodes’ network coordinates to predict inter-node la- 
tencies and provides a list of nearby peers to each joining 
node. We describe this in depth in Section 6.3.2. 


5.5 Signed ASOs 


The examples discussed so far adhere to the strict security 
policy we set out: ASOs cannot perform operations on ar- 
bitrary DHT keys or nodes. We now consider uses where 
we relax this assumption, but require that the ASO code 
be signed by the DHT administrator after manual verifi- 
cation of its security properties. As we will see below, 
this allows the DHT to deploy new functionality and ser- 
vices by using signed ASOs that access arbitrary DHT lo- 
cations, but are safe (i.e., they do not enable DoS attacks 
of targeted DHT nodes).? We have considered signed 


3In some cases, the safety of the ASO code could presumably be ver- 
ified automatically, e.g., by using sophisticated compile-time analysis; 
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ASOs in particular as a mechanism that a DHT’s devel- 
oper or administrator could use for testing and evaluation 
of new features, before they are added to the main-line 
DHT code. 


Recursive Get: Vuze and many other DHTs support 
iterative routing for key lookups. In this approach, the 
node performing the lookup is involved in every step of 
the routing operation, i.e., it identifies the target node by 
repeatedly querying DHT nodes to find other nodes that 
are closer to the target key. An alternative is to perform 
recursive routing, where intermediate nodes on the route 
pass the lookup directly to nodes that are closer to the 
key. Iterative lookup provides greater control to the node 
performing the lookup (e.g., it can control lookup paral- 
lelism), but it comes at the cost of increased latency. If 
both forms of lookup are available, an application would 
use recursive lookups by default, but fall back on iterative 
lookups after persistent failures [15]. 


With signed ASOs it is possible to implement recursive 
lookups even though the underlying DHT supports itera- 
tive lookup by default (as is the case with Chord, Kadem- 
lia, and Vuze). The node initiating the lookup creates a 
query ASO, which contains a reference to itself, and a lo- 
cal callback ID where it would like to receive the answer. 
When the signed ASO is created its onPut handler is in- 
voked; the handler queries the local routing table to find 
a live node that is closest to the target key, stores a copy 
of the signed ASO on this node, and deletes itself from 
the current node. This process is repeated until one of 
the nodes storing the target is reached, and the onUpdate 
handler of the target ASO sends the object’s value back 
to the original node, which initiated the request. 


Caching and Hierarchical Publish-Subscribe: This 
idea can be extended to accomplish both caching and hi- 
erarchical publish-subscribe data delivery. For caching, 
the onUpdate handler can be modified to communicate 
the object not only to the requesting node but also to the 
intermediate node that conveyed the request. The number 
of intermediate nodes to which the object is replicated can 
be determined by gathering and analyzing statistics on 
object popularity (also accomplished using simple han- 
dler code), so that only popular objects are replicated at 
multiple nodes (as in Beehive [39]). To implement hierar- 
chical publish-subscribe, intermediate nodes propagate a 
subscription event to the next node in the lookup process 
only if they haven’t done so before and maintain state 
for subsequent queries routed to them. When a value is 
published, it is propagated through a dissemination tree 
so that the communication load is distributed across all 
intermediate nodes (as in Scribe and Bayeux [45, 64]). 


studying this is part of future work. 


5.6 Summary 


This section described a set of example storage objects 
that we have implemented using Comet. Through these 
examples, it should be clear that with very small exten- 
sions (on the order of a few lines or a few tens of lines 
of code), a Comet application can create a wide range of 
powerful storage object behaviors that would be impossi- 
ble in existing distributed storage systems or DHTs. 


6 Evaluation 


We deployed Comet on approximately 200 PlanetLab 
hosts and evaluated our design in three steps. First, we 
characterize the resource utilization of the various appli- 
cations that we developed. Second, we measured micro- 
benchmarks to understand the overheads associated with 
active storage objects. Lastly we report on our experi- 
ences with prototyping applications using Comet. 


6.1 Application Characteristics 


Table 2 shows resource consumption requirements for 
our Comet applications. The Max Instructions column 
gives the number of dynamic Lua instructions required 
to execute the most expensive handler, while Execution 
Time gives the execution time for that handler. Where 
this value is data sensitive, we provide an estimate based 
on the expected maximum value. Code Size shows the 
size of each ASO with the minimum amount of data and 
Max Size is the maximum size to which the ASOs might 
grow for that application. From the table we see that most 
ASOs execute fewer than 100 Lua instructions and are 
smaller than 1KB in size. 





















































Application Max Execution] Code Max 

Instructions| Time Size Size 
| Replication <10 4us 0.223K | <1K | 
| Smart Replication < 100 6us 0.444K | <1K | 
| Timeouts ~ 10 4us 0.434K | <1K | 
| Limited-Read Value = 10 4us 0.553K | <1K | 
| Sensitive Value < 10 4us 0.230K | <1K | 
Pub Sub 10, 000s 54us 0.498K | 100K 
| Hierarchical Pub Sub | 100s 6us 0.673K | 1K | 
| Lifetime (External) 100s 6us 1K 6K/hr | 
| Lifetime (Internal) < 100 6us L716K | ~3K | 
| Monitoring =~ 10 4us 0.971K | 3K/hr | 
| Smart Rendezvous 1,000s 14us 1.107K | 10K | 
| Recursive Get = 50 6us 0.714K | = 1K | 





Table 2: Expected Application Resource Consumption 


6.2. Performance and Overheads 


We report on simple microbenchmark measurements to 
compare the CPU and memory costs of Vuze and Comet. 
These experiments were run on an quad-core machine 
with Xeon processors clocked at 2.67GHz. 


Single-Node Throughput. In this experiment, concur- 
rent get operations are performed on many values stored 
in the target node. We measure the throughput of get 
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Figure 2: Microbenchmarks. 


requests that return successfully using a closed feedback 
loop. All operations are issued locally on the node, so 
that network latency does not affect throughput. 

Figure 2(a) compares the throughput of objects with 
different ASO execution costs, expressed as the number 
of Lua bytecode instructions executed per handler. Both 
Comet and Vuze experience peak throughput when the 
number of concurrent operations is equal to the num- 
ber of cores (eight). ASOs with zero instructions per 
handler are functionally equivalent to Vuze values as 
they simply return themselves. The peak throughput of 
Comet ASOs is about 60% smaller than the peak through- 
put of Vuze (1.4M operations per second as opposed to 
3.5M operations per second). This shows the cost of 
the Comet/Lua execution environment. Previous mea- 
surements [49] show that the typical DHT load on Vuze 
clients in the wild is at most a few hundred operations 
per second, which makes the additional Comet overhead 
relatively insignificant in this context. As we increase 
the computational complexity of the average ASO (1K to 
1M instructions per handler), the throughput decreases, 
but still remains well above the maximum current Vuze 
workload. 


Operation Latency. At the 90th percentile, with maxi- 
mum throughput (8 concurrent operations in our exper- 
iments), a request involving 100 Lua instructions has a 
latency of about 300 microseconds. For handlers with 
1M instructions (two orders of magnitude more than our 
most compute-intensive handlers), it is 13 milliseconds. 
The latency for a Vuze DHT lookup is on the order of 
seconds, therefore the latency imposed by even extremely 
computationally intensive ASOs is not significant. 


Memory Footprint. In this experiment, we store in- 
creasing numbers of values in the nodes. For the Vuze 
nodes, the string “hello world” is stored at different keys, 
while for Comet nodes we store an equivalent Lua ASO 
which returns the same string upon a get request. Fig- 
ure 2(b) compares the memory footprint of the Vuze and 
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Comet nodes as we increase the number of stored ob- 
jects. Again using the median number of values stored 
per Vuze node (around 400), the difference in memory 
consumption at this level is negligible (about 36MB for 
both Comet and Vuze). Long lived DHT nodes can store 
10,000s of values, and the highest observed is around 
30,000 values [49]. In these rare cases, our overhead rela- 
tive to Vuze is about 27%, but even then the total memory 
footprint is still reasonable. 

We next consider a workload where Comet object 
sizes are exponentially distributed with an average size 
of 1OKB. In this case, a node with 5OOMB can store on 
average 50,000 values. If we assume an order of mag- 
nitude more values per node than in Vuze (4,000 instead 
of 400), and an order of magnitude larger values (LOKB 
instead of 1KB limit imposed by Vuze), the median node 
would consume about 80MB (40MB of startup memory 
costs and another 40MB for the ASOs) in memory.* 


6.3 Application Experience 


We now report on our experiences in prototyping and de- 
ploying some of the applications described in Section 5. 


6.3.1 Measuring Node Lifetimes 


We revisit the experiment performed by Falkner et al. [20] 
to measure the lifetimes of nodes in the Vuze DHT. This 
experiment was done by performing random get opera- 
tions from several Vuze clients in order to gather approxi- 
mately 300K IPs participating in the DHT. The collection 
of nodes was then pinged every 2.5 minutes to check for 
liveness. The authors observed that nearly half the nodes 
were immediately unavailable after first being detected. 
One weakness of the methodology employed is that the 
clients could not differentiate nodes that are unreachable 
because of NATs from those that have left the DHT. Us- 
ing measurement nodes that have active communication 
channels with NATed DHT nodes would help minimize 


4Vuze and Comet consume about 40MB without storing any values. 
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Figure 3: Node Lifetimes in Vuze. 


measurement bias, but would require the measurement to 
be performed by nodes that are within the DHT. 

Comet enables researchers to deploy experiments us- 
ing measurement ASOs executed on nodes that are part 
of the DHT. To demonstrate the feasibility of this ap- 
proach, we deployed Comet on 190 geographically dis- 
persed PlanetLab nodes and integrated them into the pro- 
duction Vuze DHT. The measurement ASOs are stored 
on the Comet nodes, and they gather information about 
unmodified Vuze nodes that are adjacent (in the DHT) 
to the Comet nodes. We stored a lifetime measurement 
ASO (a variant of the code shown in Listing 6) at each of 
the Comet nodes, allowed the nodes to perform measure- 
ments for several days, and then collected and analyzed 
the data from these nodes.° Figure 3 plots the measure- 
ment data obtained from our experiments and compared 
to the lifetime data obtained by measurement nodes that 
are not integrated into the DHT (as in [20]). We observe 
that the measurements performed from within the DHT 
provide higher estimates for node lifetimes. The reason 
is that DHT-internal measurement nodes are able to tra- 
verse NATs in communicating with their neighbors. The 
difference is significant; we measured the median node 
lifetime as 3.1 hours, as opposed to an estimate of 0.5 
hours obtained through conventional external measure- 
ments. Measurement ASOs are thus valuable tools in 
characterizing DHTs and provide more accurate data for 
tuning parameters such as replication factor, routing par- 
allelism, etc. 


6.3.2 Smart Rendezvous 


In Section 5, we proposed a way to employ intelligent 
peer tracking for distributed P2P trackers using ASOs. 
We evaluate the usefulness of this application by deploy- 
ing a distributed tracker built with Comet ASOs. As with 
traditional distributed trackers, clients participating in a 
P2P swarm (such as a BitTorrent download) register their 


5As Comet is not currently deployed by Vuze, the measurement 
ASOs are stored only on the nodes that we control. A more extensive 
deployment would allow us to obtain more samples quickly. 
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Figure 4: Proximity of BitTorrent peers. 


participation by storing their IP addresses under the ap- 
propriate DHT key. In addition, clients also store their 
network coordinates (computed using Vivaldi [13]) along 
with their IP information. When clients contact the dis- 
tributed tracker to obtain peer lists, the tracker ASO esti- 
mates the network latency between pairs of nodes using 
the supplied network coordinates and returns peers that 
are likely to be close to the requesting node. To evaluate 
this approach in practice, we deployed a tracker ASO on 
a Comet node in PlanetLab, while 190 PlanetLab nodes 
acted as peers in the swarm reporting their Vivaldi coordi- 
nates to the tracker and requesting good peers with which 
to communicate. Figure 4 depicts the effectiveness of this 
strategy compared to the default strategy of returning a 
random subset of peers to the requesting node. The graph 
shows a CDF of the measured latencies between peers 
under the two different matching schemes. The median 
value for the ASO-implemented Vivaldi intelligent peer 
matching is 47ms compared to a median of 72ms for the 
default scheme, a 35% latency improvement. 


6.3.3 Vanish 


Comet grew in part out of our experience specializing the 
Vuze DHT for Vanish [25], a self-destructing data sys- 
tem. Vanish used Vuze for key storage, however, Wol- 
chock et al. [61] showed that the Vuze system was ex- 
tremely open to a Sybil data harvesting attack that is able 
to scan the DHT for values. The attack worked in part 
because of Vuze’s overly zealous replication policy — a 
high replication factor, coupled with a policy to repli- 
cate to new nodes immediately. In response, we set out 
to deploy new replication mechanisms and other anti- 
Sybil defenses in Vuze [24]. While these mechanisms 
were straightforward, deploying them required the co- 
operation of Vuze’s designer and was an arduous and im- 
perfect process. While many iterations would have been 
necessary to fully test and optimize policies, we often 
had only one shot to catch the two-month release cycle.° 


®It takes a week or more from release until 80% of the nodes in 
Vuze adopt changes. This is in addition to a typical release cycle Vuze 
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For the same reason we were unable to test individual 
changes in isolation as they had to be shipped in bundles 
in order to make progress in reasonable time. 

We have used Comet to re-implement several of the 
changes that we deployed in Vuze. Those changes in- 
clude the customizable replication scheme described in 
Section 5 (particularly a scheme that replicates only when 
the number of replicas falls below a threshold) and vari- 
able object lifetimes. As we showed in Section 5, both 
of these changes are trivial to program as Comet ASOs. 
Perhaps even more important, testing and re-deployment 
in Comet is significantly easier, as it does not require a 
redistribution of the entire DHT code base. Instead, new 
mechanisms can be deployed by overwriting the handler 
code for existing objects and using the updated bytecode 
for subsequently created objects, without requiring the in- 
volvement of the DHT administrators.’ Had Comet ex- 
isted at the time we deployed Vanish, it would have been 
possible to customize the DHT for the security require- 
ments of the application from the start, and to optimize 
those policies to Vanish’s requirements. 


7 Security Analysis 


The classic security goals for DHTs include resilience to 
attacks that: violate the system’s ability to robustly store 
data [48], disrupt routing [48, 11], identify the partici- 
pating nodes in the DHT [53, 51], and harvest copies of 
data stored within the DHT [61]. There are numerous 
well-known techniques aimed at violating these goals, in- 
cluding Sybil attacks [19], Eclipse attacks [47], and many 
others [55]. And there are also many known mechanisms 
for protecting against such attacks, including the use of 
strong identities minted by a logically centralized author- 
ity, computational puzzles and bandwidth contributions 
proofs [9, 16, 18, 63, 10], and architectures built upon so- 
cial network structures [31, 63]. A production DHT with 
ASO support must consider such classic security goals, 
and can leverage known countermeasures for the corre- 
sponding threats. (Although, as exemplified by Vuze and 
other popular DHTs, a DHT for ASOs may decide that 
the risks associated with these threats are minimal, and 
hence not deploy the known defenses.) 

The security concerns of DHTs with signed ASOs 
are roughly those of conventional DHTs without ASOs 
(since the signed ASOs can be viewed as “vetted” parts 
of the DHT system itself); we therefore do not consider 
signed ASOs further. Empowering DHTs with unsigned 
ASOs does, however, create a new potential attack vec- 
tor not present in conventional DHTs — namely, attacks 


employs, which spans about a month. 

7In general, updating the handler code for existing objects would 
require the application to keep track of its ASOs. In the case of appli- 
cations such as Vanish, where objects are transient and have timeouts 
in the order of a few hours, we can also let existing objects just expire 
without explicitly updating them. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


via malicious ASOs. We seek to ensure that a malicious 
ASO cannot: infer private information about or damage 
its Comet hosting node; infer information about or af- 
fect the properties of other ASOs stored within Comet; 
or infer private information about or affect the proper- 
ties of other Comet nodes and arbitrary computers on 
the Internet. To place these goals in context, we stress 
that while an attacker could always use her own custom 
software to communicate with Comet in arbitrary ways, 
including putting to or getting from arbitrary ASO keys 
and communicating with the broader Internet in arbitrary 
ways, our goals — if attained — imply that ASOs cannot 
be used to amplify the attacker’s resources or capabili- 
ties. For example, an attacker should not be able to create 
an ASO “worm” that spreads virally, mounting a DDoS 
attack against a victim ASO or device on the Internet. 
We find that it is possible to meet these goals using 
three architectural features: (1) restricting system access, 
(2) restricting resource consumption, and (3) restricting 
within-Comet communication. We consider each in turn. 


Restricting system access. We designed the ASO API 
to be highly restrictive. The API explicitly restricts an 
ASO’s ability to infer private information about its host 
or to affect the host’s state. The API similarly restricts an 
ASO’s ability to interact with arbitrary devices on the In- 
ternet. For example, the API limits an ASO’s IO capabil- 
ities to explicitly defined DHT operations; arbitrary disk, 
network, and other IO operations are prohibited. The API 
also prevents an ASO from introspecting its host; e.g., al- 
though we allow the ASO to learn its host’s external IP, 
we explicitly prevent the ASO from learning its host’s in- 
ternal IP. Without these restrictions, an ASO could poten- 
tially read private files on the host’s disk, write sensitive 
files, attempt to DoS an arbitrary remote node, map the 
network topology of internal IP networks, and so on. The 
Lua sandbox provides a simple mechanism for achieving 
this isolation. Namely, we removed the IO system call 
interface and exposed one containing only the restricted 
DHT operations instead. 

Despite these restrictions, it may be possible for an 
ASO to infer (minimal) information about the hosting 
node via side-channels. For example, the time it takes 
an ASO to perform a computation could leak informa- 
tion to the ASO about the speed of the hosting processor. 
At the extreme, it may be feasible to infer modest infor- 
mation about other applications running on the hosting 
node [42]. We believe that such attacks are low risk in 
the Comet environment and do not consider them here. 


Restricting resource consumption. Comet also signif- 
icantly limits an ASO’s ability to consume resources on 
its hosting node. Our prototype limits both the memory 
and CPU consumption of ASOs. 


Memory. The Comet active runtime keeps a running 
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sum of the memory footprint of an ASO. Hard limits can 
be set on the total memory consumption of an object; 
ASOs which exceed this limit are evicted. Our current 
prototype limits ASOs to 100kB. 

CPU. The Comet runtime similarly keeps a running 
count of bytecode operations performed. We envision 
multiple policies for constraining CPU use. The naive 
policy limits each ASO to at most a limited number of 
instructions per handler invocation. Since not all Lua op- 
erations are equally costly, a more sophisticated policy 
would assign different weights to different Lua opera- 
tions (e.g., more cost for a table lookup than an addition). 
The limit could also be enforced over a fixed duration of 
time (such as 30 minutes) rather than upon each handler 
invocation (which might occur much more frequently). 
Our current prototype implements the naive restriction 
and allows 100K instructions per handler invocation. 

Comet provides support for exception handling in or- 
der to help debug faulty ASOs that exceed the system- 
imposed resource limits. Handlers can catch resource ex- 
haustion exceptions and store the relevant handler state 
as part of the ASO. The developer can then retrieve this 
stored state and inspect it to determine why the handler 
exceeded the resource limits. Further, operations that re- 
turn values, e.g., gets, provide the stack trace as a return 
value in the case of an exception. We found these fea- 
tures to be useful in debugging many of the applications 
that we prototyped using Comet. 


Restricting within-Comet communications. There 
are two classes of communications that we must consider: 
communications between one ASO and another, and call- 
back communications to a caller. 

Communications between ASOs. Allowing arbitrary 
between-ASO communications in Comet could lead to 
abuse. For example, suppose a malicious ASO stored 
under one key copies itself to a large number of other 
keys slowly over time, and then simultaneously all ASOs 
initiate connections to a victim ASO stored under some 
target key. Such an attack allows an attacker to am- 
plify her resources: the attacker invests minimal effort 
to seed the original malicious ASO, yet the ultimate at- 
tack DDoSes nodes hosting the target key. Comet takes 
a Draconian approach toward protecting against such at- 
tacks: the ASO API only allows ASOs to communicate 
if they are stored under the same key, whether co-located 
on the same Comet node or on another node within the 
DHT. Our system further rate-limits communications per- 
formed by a particular ASO. Each Comet node allots a 
limited number of network communications per time pe- 
riod for every ASO it hosts. Though we have not ex- 
perimentally ascertained appropriate rate-limiting param- 
eters, the applications we present could all work with ap- 
proximately the same number of network operations as is 
required for a value in the current Vuze DHT - about 20 


every timer interval. 


8 Conclusions 


This paper described Comet, an active distributed key- 
value store. Comet enables clients to customize a dis- 
tributed storage system in application-specific ways us- 
ing Comet’s active storage objects. By supporting ASOs, 
Comet allows multiple applications with diverse require- 
ments to share a common storage system. We imple- 
mented Comet on the Vuze DHT using a severely re- 
stricted Lua language sandbox for handler programming. 
Our measurements and experience demonstrate that a 
broad range of behaviors and customizations are possi- 
ble in a safe, but active, storage environment. 


9 Acknowledgements 


This work was supported in part by the National Science 
Foundation under grants NSF-0627367, NSF-0614975, 
NSF-0619836, NSF-0722004, and NSF-0963754, by the 
Google Fellowship in Cloud Computing, and by the 
Wissner-Slivka Chair. We thank Paul Gardner for his 
support on Vuze, and David Wetherall and our shepherd 
Wilson Hsieh for their helpful feedback on the paper. 


References 


[1] A. Acharya, M. Uysal, and J. Saltz. Active disks: Programming 
model, algorithms and evalaution. In Proc. of the 8th Conference 
on Architectural Support for Programming Languages and Oper- 
ating Systems, October 1998. 


[2] A. Adya, W. Bolosky, M. Castro, G. Cermak, R. Chaiken, 
J. Douceur, J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer. 
Farsite: Federated, available, and reliable storage for an incom- 
pletely trusted environment. In Proc. of OSDI, 2002. 


[3] Amazon S3. http://aws.amazon.com/s3/. 


[4] T. Anderson, L. Peterson, S. Shenker, and J. Turner. Overcom- 
ing the Internet impasse through virtualization. JEEE Computer, 
38(4), April 2005. 


[5] Apache Cassandra. http://cassandra.apache.org/. 


[6] B. Bershad, S. Savage, P. Pardyak, E. G. Sirer, D. Becker, M. Fi- 
uczynski, C. Chambers, and S. Eggers. Extensible, safety and 
performance in the SPIN operating system. In Proc. of the 15th 
ACM Symp. on Operating systems Principles, December 1995. 


[7] B.N. Bershad and C. B. Pinkerton. Watchdogs — extending the 
UNIX file system. Computer Systems, 1(2), 1988. 


[8] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: Support- 
ing scalable multi-attribute range queries. In Proc. of SIGCOMM, 
2004. 





[9] N. Borisov. Computational puzzles as Sybil defenses. In Proc. of 
the Intl. Conference on Peer-to-Peer Computing, 2006. 


[10] N. Borisov. Computational puzzles as Sybil defenses. In Proc. of 
the Intl. Conference on Peer-to-Peer Computing, 2006. 


[11] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wal- 
lach. Secure routing for structured peer-to-peer overlay networks. 
SIGOPS Oper. Syst. Rev., 2002. 


[12] D. R. Choffnes and F. E. Bustamante. Taming the Torrent: A 
practical approach to reducing cross-ISP traffic in P2P systems. 
In Proc. of SIGCOMM, 2008. 


[13] F. Dabek, R. Cox, F. Kaashoek, and R. Morris. Vivaldi: a de- 
centralized network coordinate system. In Proc. of SIGCOMM, 
2004. 





9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) 335 


336 


14] 


15] 


18] 








F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. 
Wide-area cooperative storage with CFS. In Proc. of SOSP, 2001. 


F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Mor- 
ris. Designing a dht for low latency and high throughput. In NSDI, 
2004. 


G. Danezis, C. Lesniewski-Laas, M. F. Kaashoek, and R. J. An- 
derson. Sybil-resistant DHT routing. In ESORICS, 2005. 


G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Laksh- 
man, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 
Dynamo: Amazon’s highly available key-value store. In Proc. of 
SOSP, 2007. 


J. Dinger and H. Hartenstein. Defending the Sybil Attack in 
P2P Networks: Taxonomy, Challenges, and a Proposal for Self- 
Registration. In Intl. Conf: on Availability, Reliability and Secu- 
rity, 2006. 


J. R. Douceur. The Sybil attack. In Proc. of IPTPS, 2002. 


J. Falkner, M. Piatek, J. P. John, A. Krishnamurthy, and T. Ander- 
son. Profiling a million user DHT. In Proc. of IMC, 2007. 


] M. J. Freedman, E. Freudenthal, and D. Maziéres. Democratizing 


content publication with coral. In NSDI, pages 239-252, 2004. 


M. J. Freedman, K. Lakshminarayanan, S. Rhea, and I. Stoica. 
Non-transitive connectivity and DHTs. In WORLDS’05, pages 
10-10, Berkeley, CA, USA, 2005. USENIX Association. 


P. Gardner. personal communication, 2009. 


R. Geambasu, T. Kohno, A. Krishnamurthy, A. Levy, H. M. Levy, 
P. Gardner, and V. Mascaritolo. Cascade: A compositional ap- 
proach to self-destructing data. In Preparation, 2010. 


R. Geambasu, T. Kohno, A. Levy, and H. Levy. Vanish: Increasing 
data privacy with self-destructing data. In Proc. of the USENIX 
Security Symposium, August 2009. 


K. Hildrum and J. Kubiatowicz. Asymptotically Efficient Ap- 
proaches to Fault-Tolerance in Peer-to-peer Networks. In Proc. 
of International Symposium on Distributed Computing, 2004. 


T. Isdal, M. Piatek, A. Krishnamurthy, and T. Anderson. Privacy- 
preserving P2P data sharing with OneSwarm. In Proc. of SIG- 
COMM, 2010. 


M. F. Kaashoek, D. R. Engler, G. R. Ganger, H. M. Briceno, 
R. Hunt, D. Mazieres, T. Pinckney, R. Grimm, J. Jannotti, , and 
K. Mackenzie. Application performance and flexibility in exoker- 
nel systems. In Proc. of SOSP, 1997. 


K. Keetong, D. Patterson, and J. Hellerstein. A case for intelligent 
disks (IDISKs). ACM SIGMOD Record, 27(3), August 1998. 


E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. 
The click modular router. In Proc. of the 17th ACM Symp. on 
Operating Systems Principles, December 1999. 

C. Lesniewski-Lass and M. F. Kaashoek. Whanaungatanga: 
Sybil-proof distributed hash table. In Proc. of NSDI, 2010. 

N. A. Lynch, D. Malkhi, and D. Ratajczak. Atomic Data Access 
in Distributed Hash Tables. In Proc. of IPTPS, 2001. 

H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, 
A. Krishnamurthy, and A. Venkataramani. iPlane: An Information 
Plane for Distributed Services. In OSDI, 2006. 

A. Muthitacharoen, S. Gilbert, and R. Morris. Etna: A fault- 


tolerant algorithm for atomic mutable DHT data. Technical Re- 
port MIT-LCS-TR-993, MIT, June 2005. 


Mysql Database Triggers. http://dev.mysql.com/doc/ 
refman/5.0/en/triggers.html. 


L. Peterson, A. Bavier, M. Fiuczynski, and S. Muir. Experiences 
implementing PlanetLab. In Proc. of OSDI, 2006. 


M. Piatek, T. Isdal, A. Krishnamurthy, and T. Anderson. Do in- 
centives build robustness in BitTorrent? In NSDI, 2007. 


Project Voldemort. http: //project-voldemort.com/. 


V. Ramasubramanian and E. G. Sirer. Beehive: O(1) lookup per- 
formance for power-law query distributions in peer-to-peer over- 
lays. In Proc. of NSDI, Berkeley, CA, USA, 2004. USENIX As- 
sociation. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 














S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, 
S. Shenker, I. Stoica, and H. Yu. OpenDHT: A public DHT ser- 
vice and its uses. In Proc. of SIGCOMM, 2005. 


E. Riedel, G. Gibson, and C. Faloutsos. Active storage for large- 
scale data mining and multimedia. In Proc. of 24th International 
Conference on Very Large Databases, August 1998. 


T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, you, 
get off of my cloud: exploring information leakage in third-party 
compute clouds. In Proc. of CCS, 2009. 


W. C. F. Roberto Ierusalimschy, Luiz Henrique de Figueiredo. Lua 
- an extensible extension language. Software: Practice and Expe- 
rience, 26(6):635—652, 1999. 


A. Rowstron and P. Druschel. Storage management and caching 
in PAST, a large-scale, persistent peer-to-peer storage utility. In 
Proc. of SOSP, 2001. 


A. I. T. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. 
Scribe: The design of a large-scale event notification infrastruc- 
ture. In Proc. of the Third International COST264 Workshop on 
Networked Group Communication, 2001. 


M. Seltzer, Y. Endo, C. Small, and K. Smith. Dealing With Dis- 
aster: Surviving Misbehaved Kernel Extensions. In OSDI, 1996. 


A. Singh, T.-W. Ngan, P. Druschel, , and D. Wallach. Eclipse 
attacks on overlay networks: Threats and defenses. In IVFOCOM, 
2006. 


E. Sit and R. Morris. Security considerations for peer-to-peer dis- 
tributed hash tables. In Proc. of IPTPS, 2002. 


M. Steiner and E. W. Biersack. Crawling AZUREUS. Technical 
report, Institut Eurecom, Networking and Security Department, 
2008. 


M. Steiner, E. W. Biersack, and T. Ennajjary. Actively monitoring 
peers in KAD. In Proc. of IPTPS, 2007. 


M. Steiner, T. En-Najjary, and E. W. Biersack. A Global View of 
KAD. In Proc. of IMC, 2007. 


I. Stoica, D. Adkins, S. Zhuang, S. S. nker, and S. Surana. Internet 
indirection infrastructure. In Proc. of SIGCOMM, 2002. 


D. Stutzbach and R. Rejaie. Understanding Churn in Peer-to-Peer 
Networks. In Proc. of IMC, 2006. 


D. L. Tennenhouse and D. J. Wetherall. Towards an active net- 
work architecture. ACM SIGCOMM Computer Communications 
Review, April 1996. 


G. Urdaneta, G. Pierre, and M. V. Steen. A Survey of DHT Secu- 
rity Techniques (to appear). ACM Computing Survey, 2010. 


uTorrent. http: //www.utorrent.com. 


T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. 
Active messages: a mechanism for integrated communication and 
computation. In Proc. of ISCA, 1992. 


Vuze, Inc. http://www. vuze.com. 


P. Wang, I. Osipkov, N. Hopper, and Y. Kim. Myrmic: Secure and 
Robust DHT Routing. Technical report, University of Minnesota, 
2007. 


D. Wetherall. Active network vision and reality: Lessons from a 
capsule-based system. In Proc. of the 17th ACM Symp. on Oper- 
ating Systems Principles, December 1999. 


S. Wolchok, O. S. Hofmann, E. W. Felten, J. A. Halderman, C. J. 
Rossbach, B. Waters, and E. Witchel. Defeating Vanish with low- 
cost Sybil attacks against large DHTs. In Proc. of NDSS, 2010. 


H. Xie, R. Yang, A. Krishnamurthy, Y. Liu, and A. Silberschatz. 
P4P: Provider portal for P2P applications. In Proc. of SIGCOMM, 
2008. 


H. Yu, M. Kaminsky, P. B. Gibbons, and A. D. Flaxman. Sybil- 
Guard: defending against sybil attacks via social networks. Proc. 
of SIGCOMM, 2006. 


S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, and J. D. Ku- 
biatowicz. Bayeux: an architecture for scalable and fault-tolerant 
wide-area data dissemination. In Proc. of NOSSDAV, 2001. 


USENIX Association 


USENIX Association 


SPORC: Group Collaboration using Untrusted Cloud Resources 
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Abstract 


Cloud-based services are an attractive deployment 
model for user-facing applications like word processing 
and calendaring. Unlike desktop applications, cloud ser- 
vices allow multiple users to edit shared state concurrently 
and in real-time, while being scalable, highly available, 
and globally accessible. Unfortunately, these benefits 
come at the cost of fully trusting cloud providers with 
potentially sensitive and important data. 

To overcome this strict tradeoff, we present SPORC, a 
generic framework for building a wide variety of collabo- 
rative applications with untrusted servers. In SPORC, a 
server observes only encrypted data and cannot deviate 
from correct execution without being detected. SPORC 
allows concurrent, low-latency editing of shared state, 
permits disconnected operation, and supports dynamic 
access control even in the presence of concurrency. We 
demonstrate SPORC’s flexibility through two prototype 
applications: a causally-consistent key-value store and a 
browser-based collaborative text editor. 

Conceptually, SPORC illustrates the complementary 
benefits of operational transformation (OT) and fork* 
consistency. The former allows SPORC clients to execute 
concurrent operations without locking and to resolve any 
resulting conflicts automatically. The latter prevents a 
misbehaving server from equivocating about the order of 
operations unless it is willing to fork clients into disjoint 
sets. Notably, unlike previous systems, SPORC can auto- 
matically recover from such malicious forks by leveraging 
OT’s conflict resolution mechanism. 


1 Introduction 


An emerging class of cloud-based collaborative services, 
such as online document processing and calendaring, pro- 
vides users with anywhere-available, real-time, and con- 
current access to shared state. Their deployments on man- 
aged cloud platforms enjoy global accessibility, high avail- 
ability, fault tolerance, and elastic resource allocation and 
scaling. Yet these benefits have come at the cost of having 
a fully trusted server, creating a risk of privacy problems 
due to server-side information leaks. The history of such 
services is one rife with unplanned data disclosures and 
malicious break-ins [24]. Indeed, the very centralization 
of information makes cloud providers high value targets 
for attack. Further, the behavior of service providers them- 
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selves is a source of users’ privacy angst, as privacy poli- 
cies may be weakened due to market expediencies. Finally, 
cloud providers face pressure from government agencies 
world-wide to release information on demand [15]. 


This paper challenges the belief that applications must 
sacrifice strong security and privacy to enjoy the bene- 
fits of cloud deployment. We present a system, SPORC, 
that offers managed cloud-based deployment for group 
collaboration services, yet does require users to trust the 
cloud provider to maintain data privacy or even to oper- 
ate correctly. SPORC’s cloud servers see only encrypted 
data, and clients will detect any deviation from correct 
operation (e.g., adding, modifying, dropping, or reorder- 
ing operations) and will recover from the error. Much 
like SUNDR [24], SPORC bases its security and privacy 
guarantees on the security of users’ cryptographic keys, 
and not on the cloud provider’s good intentions nor on 
some threshold-like protocol between servers [9] that is 
susceptible to administrative or software attacks. 


SPORC provides a generic collaboration service in 
which users can create a document, modify its access con- 
trol list, edit it concurrently, experience fully automated 
merging of updates, and even perform these operations 
while disconnected. The SPORC framework supports a 
broad range of collaborative applications. Data updates 
are encrypted before being sent to a cloud-hosted server. 
The server assigns a total order to all operations and re- 
distributes the ordered updates to clients. If a malicious 
server drops or reorders updates, the SPORC clients can 
detect the server’s misbehavior, switch to a new server, 
restore a consistent state, and continue. The same mech- 
anism that allows SPORC to merge correct concurrent 
operations also enables it to transparently recover from 
attacks that fork clients’ views. 

From a conceptual distributed systems perspective, 
SPORC demonstrates the benefit of combining opera- 
tional transformation [11] and fork* consistency proto- 
cols [23]. Operational transformation (OT) defines a 
framework for executing lock-free concurrent operations 
that both preserves causal consistency and converges to a 
common shared state. It does so by transforming opera- 
tions so they can be applied commutatively by different 
clients, resulting in the same final state. While OT origi- 
nated with decentralized applications using pairwise rec- 
onciliation [11, 18], recent systems like Google Wave [44] 
have used OT with a trusted central server that orders and 
transforms clients’ operations. Fork* consistency, on the 
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other hand, was introduced as a consistency model for 
interacting with an untrusted server: If the server causes 
the views of two clients to diverge, the clients must either 
never see each others’ subsequent updates or else identify 
the server as faulty. 

Recovering from a malicious fork is similar to recon- 
ciling concurrent operations in the OT framework. Upon 
detecting a fork, SPORC clients use OT mechanisms to 
replay and transform forked operations, restoring a consis- 
tent state. Previous applications of fork* consistency [23] 
could only detect forks, but not resolve them. 

This paper makes the following contributions: 


§2 We identify and explore the conceptual connection 
between operational transformation protocols and the 
fork* consistency model, and use this connection to 
motivate SPORC’s design. 


§3 We describe SPORC’s framework and protocols for 
real-time collaboration. SPORC provides security 
and privacy against both an untrusted server that me- 
diates communication and other clients that lack ac- 
cess control permissions. 


84 We demonstrate how to support dynamic access con- 
trol, which is challenging because SPORC supports 
concurrent operations and offline editing. 


85 We describe how clients can detect and recover 
from maliciously-instigated forks. We also present a 
checkpoint mechanism that reduces saved client state 
and minimizes the join overhead for new clients. 


86 We illustrate the extensibility of SPORC’s pluggable 
data model by building both a key-value store and a 
browser-based collaborative text editor. We imple- 
ment these services as both stand-alone applications 
and web services; the latter run in a browser, execute 
in JavaScript (compiled from Java via GWT [12]), 
and require no prior installation. 


We evaluate SPORC’s performance in Section 7 before 
discussing related work and concluding. 


2 System Model 


The purpose of SPORC is to allow a group of users who 
trust each other to collaboratively edit some shared state, 
which we call the document, with the help of an untrusted 
server. SPORC is comprised of a set of client devices 
that modify the document on behalf of particular users, 
and a potentially-malicious server whose main role is to 
impose a global order on those modifications. The server 
receives updates from individual clients, orders them, and 
then broadcasts them to the other clients. Access to the 
document is limited to a set of authorized users, but each 
user may be logged into arbitrarily many clients simul- 
taneously (e.g., her desktop, laptop, and mobile phone). 
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Each client, even if it is controlled by the same user as 
another client, has its own local view of the document that 
must be synchronized with all other clients. 


2.1 Goals 


We designed SPORC with the following goals in mind: 

Flexible framework for a broad class of collabora- 
tive services. Because SPORC uses an untrusted server 
which does not see application-level content, the server is 
generic and can handle a broad class of applications. On 
the client side, SPORC provides a library suitable for use 
by a range of desktop and web-based applications. 

Propagate modifications quickly. When a client is 
connected to the network, its changes to the shared state 
should propagate quickly to all other clients so that clients’ 
views are nearly identical. This property makes SPORC 
suitable for building collaborative applications requiring 
nearly real-time updates, such as collaborative text editing 
and instant messaging. 

Tolerate slow or disconnected networks. To allow 
clients to edit the document while offline or while experi- 
encing high network latency, clients in SPORC update the 
document optimistically. Every time a client generates a 
modification, the client applies it immediately to its local 
state, and only later sends it to the server for redistribu- 
tion. As a result, clients’ local views of the document will 
invariably diverge, and SPORC must be able to resolve 
these divergences automatically. 

Keep data confidential from the server and unau- 
thorized users. Since the server is untrusted, document 
updates must be encrypted before being sent to the server. 
For efficiency, the system should use symmetric-key en- 
cryption. SPORC must provide a way to distribute this 
symmetric key to every client of authorized users. When 
a document’s access control list changes, SPORC must 
ensure that newly added users can decrypt the entire docu- 
ment, and that removed users cannot decrypt any updates 
subsequent to their expulsion. 

Detect a misbehaving server. Even without access to 
document plaintext, a malicious server could still do signif- 
icant damage by deviating from its assigned role. It could 
attempt to add, drop, alter, or delay clients’ (encrypted) 
updates, or it could show different clients inconsistent 
views of the document. SPORC must give clients a means 
to quickly detect these kinds of misbehavior. 

Recover from malicious server behavior. If clients 
detect that the server is misbehaving, clients should be 
able to failover to a new server and resume execution. 
Since a malicious server could cause clients to have incon- 
sistent local state, SPORC must provide a mechanism for 
automatically resolving these inconsistencies. 

To achieve these goals, SPORC builds on two concep- 
tual frameworks: operational transformation and fork* 
consistency. 
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2.2 Operational Transformation 


Operational Transformation (OT) [11] provides a general 
model for synchronizing shared state, while allowing each 
client to apply local updates optimistically. In OT, the 
application defines a set of operations from which all 
modifications to the document are constructed. When 
clients generate new operations, they apply them locally 
before sending them to others. To deal with the conflicts 
that these optimistic updates inevitably incur, each client 
transforms the operations it receives from others before 
applying them to its local state. If all clients transform 
incoming operations appropriately, OT guarantees that 
they will eventually converge to a consistent state. 

Central to OT is an application-specific transformation 
function T(-) that allows two clients whose states have 
diverged by a single pair of conflicting operations to re- 
turn to a consistent, reasonable state. T(op, op2) takes 
two conflicting operations as input and returns a pair of 
transformed operations (op/,, op!,), such that if the party 
that initially did op; now applies op’, and the party that 
did op2 now applies op’, the conflict will be resolved. 

To use the example from Nichols et al. [30], sup- 
pose Alice and Bob both begin with the same local state 
“ABCDE”, and then Alice applies op; = ‘del 4’ locally 
to get “ABCE”, while Bob performs op2 = ‘del 2’ to 
get “ACDE”. If Alice and Bob exchanged operations and 
executed each others’ naively, then they would end up 
in inconsistent states (Alice would get “ACE” and Bob 
“ACD”). To avoid this problem, the application supplies the 
following transformation function that adjusts the offsets 
of concurrent delete operations: 


(del x—1,del y) if a>y 
T(del x,del y) = ¢ (del z,del y—1) if a<y 
(no-op,no-op) if w=y 


Thus, after computing T(op;,op2), Alice will apply 
ops, =‘del 2’ as before but Bob will apply op, = ‘del 3’, 
leaving both in the consistent state “ACE”. 

Given this pair-wise transformation function, clients 
that diverge in arbitrarily many operations can return to a 
consistent state by applying the transformation function 
repeatedly. For example, suppose that Alice has optimisti- 
cally applied op; and opz to her local state, but has yet 
to send them to other clients. If she receives a new op- 
eration OPnew, Alice must transform it with respect to 
both op; and ops: She first computes (op’,..,,,, 0p) — 
T(Oprea op1); and then COD ets ops) — DOD casiss op2). 
This process yields op’’.,,,, an operation that Alice has 
“transformed past” her two local operations and can now 
apply to her local state. 

Throughout this paper, we use the notation op’ — 
T (op, (opi,...,0Pn)) to denote transforming op past a 
sequence of operations (op1,...,0p,,) by iteratively ap- 


plying the transformation function.! Similarly, we de- 


fine (op',,...,op),) <— T((opi,...,0Pn), op) to repre- 
sent transforming a sequence of operations past a single 
operation. 

Operational transformation can be applied in a wide 
variety of settings, as operations, and the transforms on 
them, can be tailored to each application’s requirements. 
For a collaborative text editor, operations may contain 
inserts and deletes of character ranges at specific cursor 
offsets, while for a causally-consistent key-value store, 
operations may contain lists of keys to update or remove. 
In fact, we have implemented both such systems on top of 
SPORC, which we describe further in Section 6. 

For many applications, with a carefully-chosen trans- 
formation function, OT is able to automatically return 
divergent clients to a state that is not only consistent, but 
semantically reasonable as well. But for some applica- 
tions, such as source-code version control, semantic con- 
flicts must be resolved manually. OT can support such 
applications through the choice of a transformation func- 
tion that does not try to resolve the conflict, but instead 
inserts an explicit conflict marker into the history of oper- 
ations. A human can later examine the marker and resolve 
the conflict by issuing new writes. These write operations 
will supercede the conflicting operations, provided that the 
system preserves the global order of committed operations 
and the partial order of each client’s operations. Section 3 
describes how SPORC provides these properties. 

While OT was originally proposed for decentralized n- 
way synchronization between clients, many prominent OT 
implementations are server-centric, including Jupiter [30] 
and Google Wave [44]. They rely on the server to resolve 
conflicts and to maintain consistency, and are architec- 
turally better suited for web services. On the flip side, a 
misbehaving server can compromise the confidentiality, 
integrity, and consistency of the shared state. 

Later, we describe how SPORC adapts these server- 
based OT architectures to provide security against a mis- 
behaving server. At a high level, SPORC has each client 
simulate the transformations that would have been applied 
by a trusted OT server, using the server only for ordering. 
But we still need to protect against inconsistent orderings, 
for which we leverage fork* consistency techniques [23]. 


2.3. Fork* Consistency 


To prevent a malicious server from forging or modifying 
clients’ operations, clients in SPORC digitally sign all 
their operations with their user’s private key. This is not 
sufficient for correctness, however: a misbehaving server 
could still equivocate and present different clients with 
divergent views of the history of operations. 


'Strictly speaking, T always returns a pair of operations. For sim- 
plicity, however, we sometimes write T’ as returning a single operation, 
especially when the other is unchanged, as in our “delete char” example. 
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To defend against server equivocation, SPORC clients 
enforce fork* consistency [23].” In fork*-consistent sys- 
tems, clients share information about their individual 
views of the history by embedding it in every operation 
they send. As a result, if clients to whom the server has 
equivocated ever communicate, they will discover the 
server’s misbehavior. The server can still divide its clients 
into disjoint groups and only tell each client about oper- 
ations by others in its group. But, once the server has 
forked two groups in this way, it cannot tell a member 
of one group about an operation submitted by another 
group’s members without risking detection. 

As in BFT2F [23], each SPORC client enforces fork* 
consistency by maintaining a hash chain over its view of 
the committed history. In this context, a hash chain is a 
method of incrementally computing the hash of a list of 
elements. More specifically, if op1,..., opp, are the opera- 
tions in the history, Ao is a constant initial value, and h, is 
the value of the hash chain over the history up to op,;, then 
h; = H(hi-1||H(op;)), where H(-) is a cryptographic 
hash function and || denotes concatenation. When a client 
with history up to op, submits a new operation, it includes 
hy in its message. On receiving the operation, another 
client can check whether the included h,, matches its own 
hash chain computation over its local history up to op,. 
If they do not match, the client knows that the server has 
equivocated. 


2.4 The Benefits of Having a Server 


SPORC uses a central untrusted server, but the server’s 
sole purpose is to order and store client-generated opera- 
tions. This limited role may lead one to ask whether the 
server should be removed, leading to a completely peer-to- 
peer design. Indeed, many group collaboration systems, 
such as Bayou [43] and Network Text Editor [17], employ 
decentralized architectures. Decentralized designs are a 
poor fit, however, for applications in which a user needs a 
timely notification that her operation has been committed 
and will not be overridden by another’s (not yet received) 
operation. For example, to schedule a meeting room, an 
online user should be able to quickly determine whether 
her reservation succeeded, without worrying if an offline 
client’s request will override hers. Yet this is difficult to 
achieve without waiting to hear from all (or at least a quo- 
rum of) other clients, which poses a problem when clients 
are regularly offline. In reaction, Bayou delegates com- 


?Fork* consistency is a weaker variant of an earlier model called fork 
consistency [27]. They differ in that under fork consistency, a pair of 
clients only needs to exchange one message to detect server equivocation, 
whereas under fork* consistency, they may need to exchange two. For 
OT systems like ours, this distinction makes little difference because 
clients constantly exchange small messages. On the other hand, fork* 
consistency permits a one-round protocol to submit operations, rather 
than two. Beyond efficiency, this also ensures that a crashed client cannot 
prevent the system from making progress. 
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mits to a (statically) designated, trusted “primary” peer, 
which is little different from having a server. 

SPORC, on the other hand, only requires an untrusted 
server for globally ordering operations. Thus, it can lever- 
age the benefits of a cloud deployment—high availability 
and global accessibility—to achieve timely commits. We 
show in Section 4.2 how SPORC’s centralized server also 
helps support dynamic access control and key rotation, 
even in the face of concurrent membership changes. 


2.5 Deployment and Threat Model 


Deployment Assumptions. While most of the paper dis- 
cusses the SPORC protocol in terms of a single server and 
a single document, we assume that a cloud-based SPORC 
deployment would manage large numbers of users and 
documents by replicating functionality and partitioning 
state over many servers. Each document in SPORC can be 
managed independently, leading naturally to the shared- 
nothing architectures [36] already common to scalable 
cloud services. 

For a client to recover from a misbehaving server, we 
assume there exists some alternative (untrusted) server 
to switch to after a client detects faulty behavior. These 
backup servers may belong to the same or different admin- 
istrative domains as the original, depending upon the type 
of faults that a SPORC deployment expects to encounter. 

Note that even if malicious (Byzantine) behavior among 
cloud servers is not a primary concern, this strong threat 
model also covers weaker non-crash failures related to 
server misconfiguration, Heisenbugs, or “split-brain” par- 
titioned behavior. In all cases, failover and recovery is 
client driven. Crash failures, unlike Byzantine failures, 
would not result in forks and could be handled by tra- 
ditional fault-tolerance techniques (e.g., primary/backup 
replication) already employed in cloud services. 


Threat Model. 
assumptions: 
Server: The server is potentially malicious, and a mis- 
behaving server may be able to prevent progress, but it 
must not be able to corrupt the clients’ shared state. A 
server may fork clients’ states, but only within the con- 
fines of the fork* consistency model. If clients are able to 
communicate either in-band or out-of-band, server equiv- 
ocation will be detected promptly by at least one client. 
The server may be able to learn which users and clients 
are sharing a document, but it must not learn what is in the 
document or even the contents of the individual operations 
that the clients submit. Since the server has access to the 
size and timing of clients’ operations, it may be able to 
glean some information about the document via traffic 
analysis. Traffic analysis is made more difficult by the 
fact that encrypted operations do not even reveal which 
portions of the shared state they modify. Neverthless, the 
complete mitigation of traffic analysis is beyond the scope 
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of this work, but it would likely involve padding the length 
of operations and introducing cover traffic. 

To attack availability, the server may arbitrarily erase 
or refuse to return any of the encrypted data that it stores. 
To mitigate this threat, the encrypted data could be repli- 
cated on servers in other administrative domains. More- 
over, each client could replicate its own local state on 
cloud servers other than the main SPORC server. Notably, 
SPORC cannot guarantee recovery from every possible 
fork, unless every client stores every operation that it has 
seen either locally or remotely. 

Clients: If a client is logged in as a particular user, that 
client is trusted to exercise the privileges granted to that 
user (e.g., to see the state, modify it, or modify access 
privileges). Otherwise, clients are untrusted, and they 
should not be able to see the document, or to modify the 
document or its access control list, even if they collude 
with each other or with the server. 

User authentication and keys: We assume that each 
user has a secure public/private key pair, and that clients 
have a secure way to verify the public key of other users. 

Application code: We assume the presence of a code 
authentication infrastructure that can verify that the appli- 
cation code run by clients is genuine. This mechanism 
might rely on code signing or on HTTPS connections to a 
trusted server (different from the untrusted server used as 
part of SPORC’s protocols). 


3 System Design 


This section describes SPORC’s design in more detail, in- 
cluding its synchronization mechanisms and the measures 
that clients implement to detect a malicious server that 
may modify, reorder, duplicate, or drop operations. This 
section assumes that the set of users and clients editing a 
given document is fixed; we consider dynamic member- 
ship in Section 4. 


3.1 System Overview 


The parties and stages involved with SPORC operations 
are shown in Figure |. At a high level, the local state of 
a SPORC application is synchronized between multiple 
clients, using a server to collect updates from clients, order 
them, then redistribute the client updates to others. There 
are four types of state in the system. 


(1) The local state is a compact representation of the 
client’s current view of the document (e.g., the most recent 
version of a collaborative-edited text). 


(2) The encrypted history is the set of operations stored 
at and ordered by the server. The payloads of operations 
that change the contents of the document are encrypted 
to preserve confidentiality. The server orders the opera- 
tions oblivious to their payloads but aware of the previous 
operations on which they causally depend. 
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Figure |: SPORC architecture and synchronization steps 
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(3) The committed history is the official set of (plain- 
text) operations shared among all clients, as ordered by 
the server. Clients derive this committed history from 
the server’s encrypted history by transforming operations’ 
payloads to reflect any changes that the server’s ordering 
might have caused. 


(4) A client’s pending queue is an ordered list of the 
client’s local operations that have already been applied 
to its local state, but that have yet to be committed (i.e., 
assigned a sequence number by the server and added to 
the client’s committed history). 


SPORC synchronizes clients’ local state for a partic- 
ular document using the following steps, also shown in 
Figure 1. This section restricts its consideration to interac- 
tions with a static membership and well behaved server; 
we relax these restrictions in the next two sections, re- 
spectively. The flow of local operations to the server is 
illustrated by dashed blue arrows; the flow of operations 
received from the server is shown by solid red arrows. 


1. A client application generates an operation, applies 
it to its local state immediately, and then places it at 
the end of the client’s pending queue. 


2. If the client does not currently have any operations 
under submission, it takes its oldest queued operation 
yet to be sent, op, assigns it a client sequence number 
(clntSeqNo), embeds in it the global sequence number 
of the last committed operation (prevSeqNo) along 
with the corresponding hash chain value (prevHC), 
encrypts its payload, digitally signs it, and transmits 
it to the server. (As an optimization, if the client 
has multiple operations in its pending queue, it can 
submit them as a single batched operation.) 


3. The server adds the client-submitted op to its en- 
crypted history, assigning it the next available global 
sequence number (seqNo). The server forwards op 
with this seqNo to all the clients participating in the 
document. 


4. Upon receiving an encrypted operation op, the client 
verifies its signature (V) and checks that its clntSe- 
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qgNo, seqNo, and prevHC fields have the expected 
values. If these checks succeed, the client decrypts 
the payload (D) for further processing. If they fail, 
the client concludes that the server is malicious. 


5. Before adding op to its committed history, the client 
must transform it past any other operations that had 
been committed since op was generated (i.e., all those 
with global sequence numbers greater than op’s prev- 
SeqNo). Once op has been transformed, the client 
appends op to the end of the committed history. 


6. If the incoming operation op was one that the client 
had initially sent, the client dequeues the oldest ele- 
ment in the pending queue (which will be the uncom- 
mitted version of op) and prepares to send its next 
operation. Otherwise, the client transforms op past 
all its pending operations and, conversely, transforms 
those operations with respect to op. 


7. The client returns the transformed version of the in- 
coming operation op to the application. The applica- 
tion then applies op to its local state. 


SPORC maintains the following invariants with respect to 
the system’s state: 


Local Coherence: A client’s local state is equivalent 
to the state it would be in if, starting with an initial empty 
document, it applied, in order, all of the operations in its 
committed history followed by all of the operations in its 
pending queue. 


Fork* Consistency: If the server is well behaved, all 
clients’ committed histories are linearizable (i.e., for every 
pair of clients, one client’s committed history is equal to 
or a prefix of the other client’s committed history). If the 
server is faulty, however, clients’ committed histories may 
be forked [23]. 


Client-Order Preservation: The order that a non- 
malicious server assigns to operations originating from 
a given client must be consistent with the order that the 
client assigned to those operations. 


3.2 Operations 


SPORC clients exchange two types of operations: docu- 
ment operations, which represent changes to the content 
of the document, and meta-operations, which represent 
changes to document metadata such as the document’s 
access control list. Meta-operations are sent to the server 
in the clear, but the payloads of document operations are 
encrypted under a symmetric key that is shared among 
all of the clients but is unknown to the server. (See Sec- 
tion 4.1 for a description of how this key is chosen and 
distributed.) In addition, every operation is labeled with 
the name of the user that created it and is digitally signed 
by that user’s private key. All operations also contain a 
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unique client ID (clntID) that identifies from which of the 
user’s client machines it came. 


3.3. The Server’s Limited Role 


Because the SPORC server is untrusted, its role is limited 
to ordering and storing the operations that clients submit, 
most of which are encrypted. The server stores the opera- 
tions in its encrypted history so that new clients joining the 
document or existing clients that have been disconnected 
can request from the server the operations that they are 
missing. This storage function is not essential, however, 
and in principle it could be handled by a different party. 

Notably, since the server does not have access to the 
plaintext of document operations, the same generic server 
implementation can be used for any application that uses 
our protocol regardless of the kind of document being 
synchronized. 


3.4 Sequence Numbers and Hash Chains 


SPORC clients use sequence numbers and a hash chain to 
ensure that operations are properly serialized and that the 
server is well behaved. Every operation has two sequence 
numbers: a client sequence number (c/ntSeqNo) which is 
assigned by the client that submitted the operation, and 
a global sequence number (segNo) which is assigned by 
the server. On receiving an operation, a client verifies 
that the operation’s clntSeqNo is one greater than the last 
clntSeqNo seen from the submitting client, and that the op- 
eration’s seqNo is one greater than the last seqNo that the 
receiving client saw. These sequence number checks en- 
force the “client order preservation” invariant and ensure 
that there are no gaps in the sequence of operations. 

When a client uploads an operation opyew to the server, 
the client sets Opnew’s prevSeqNo field to the global se- 
quence number of the last committed operation, op,,, that 
the client knows about. The client also sets Opnew’s pre- 
VHC field to the value of the client’s hash chain over the 
committed history up to op,,. A client who receives opnew 
compares its prevHC with the client’s own hash chain com- 
putation up to op,. If they match, the recipient knows that 
its committed history is identical to the sender’s committed 
history up to op, thereby guaranteeing fork* consistency. 

A misbehaving server cannot modify the prevSeqNo or 
prevHC fields, because they are covered by the submitting 
client’s signature on the operation. The server can try 
to tell two clients different global sequence numbers for 
the same operation, but this will cause the two clients’ 
histories—and hence their future hash chain values—to 
diverge, and it will eventually be detected. 

To simplify the design, each SPORC client has at most 
one operation “‘in flight’ at any time: only the operation 
at the head of a client’s pending queue can be sent to the 
server. Among other benefits, this rule ensures that oper- 
ations’ prevSeqNo and prevHC values will always refer 
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to operations that are in the committed history, and not 
to other operations that are “in flight.’ This restriction 
could be relaxed, but only at considerable cost in complex- 
ity. For similar reasons, other OT-based systems such as 
Google Wave adopt the same rule [44]. 

Prohibiting more than one in-flight operation per client 
is less restrictive than it might seem, as operations can be 
combined or batched. Like Wave, SPORC includes an 
application-specific composition function, which consoli- 
dates two operations into one. This can be used iteratively 
to combine a sequence of operations into a single one. 
Further, it is straightforward to batch multiple operations 
into a single logical operation, which is then submitted as 
a unit. Because operations can be composed or batched, a 
client can empty its pending queue every time it gets an 
opportunity to submit an operation to the server. 


3.5 Resolving Conflicts with OT 


Once a client has validated an operation received from the 
server, the client must use OT to resolve the conflicts that 
may exist between the new operation and other operations 
in the committed history and pending queue. These con- 
flicts might have arisen for two reasons. First, the server 
may have committed additional operations since the new 
operation was generated. Second, the receiving client’s lo- 
cal state might reflect uncommitted operations that reside 
on the client’s pending queue but that other clients do not 
yet know about. 

Before a client appends an incoming operation opnew 
to its committed history, it compares Opnew’s prevSeqNo 
value with the global sequence number of the last com- 
mitted operation. The prevSeqNo field indicates the last 
committed operation that the submitting client knew about 
when it uploaded opyew. Thus, if the values match, the 
client knows that no additional operations have been added 
to its committed history since opnew was generated, and 
the new operation can be appended directly to the commit- 
ted history. But if they do not match, then other operations 
were committed since Opnew was sent, and Oopnew needs 
to be transformed past each of them. For example, if 
OPnew has a prevSeqNo of 10, but was assigned global 
sequence number 14 by the server, then the client must 
compute op’...,, — T(0Pnew; (0P11, OP12; OP13)) where 
(0p11, OP12, Op13) are the intervening committed opera- 
tions. Only then can the resulting transformed operation 
OPp},ew be appended to the committed history. After ap- 
pending the operation, the client updates the hash chain 
computed over the committed history so that future incom- 
ing operations can be validated. 

At this point, if op’,..,, is one of the receiving client’s 
own operations that it had previously uploaded to the 
server (or a transformed version of it), it will necessarily 
match the operation at the head of the pending queue. 
Since op’,.., has now been committed, its uncommitted 


version can be retired from the pending queue, and the 
next pending operation can be submitted to the server. 
Furthermore, since the client has already optimistically 
applied the operation to its local state even before sending 
it to the server, the client does not need to apply op/,.., 
again, and nothing more needs to be done. 

If op’. 18 not one of the client’s own operations, how- 
ever, the client must perform additional transformations in 
order to reestablish the “local coherence” invariant, which 
states that the client’s local state is equal to the in-order 
application of its committed history followed by its pend- 
ing queue. First, in order to obtain a version of op}, .,,, that 
it can apply to its local state, the client must transform 
OP},ew past all of the operations in its pending queue. This 
step is necessary because the pending queue contains oper- 
ations that the client has already applied locally, but have 
not yet been committed and, therefore, were unknown to 
the sender of op/,..,,,- 

Second, the client must transform the entire pend- 
ing queue with respect to op/,.,, to account for the 
fact that op,..,, Was appended to the committed history. 
More specifically, the client computes (op},..., op/,,) — 
T((Opi,..-,0Pm), Diem) Where (Opi,...,OPm) is the 
pending queue. This transformation has the effect of push- 
ing the pending queue forward by one operation to make 
room for the newly extended committed history. The op- 
erations on the pending queue need to stay ahead of the 
committed history because they will receive higher global 
sequence numbers than any of the currently committed 
operations. Furthermore, by transforming its unsent oper- 
ations in response to updates to the document, the client 
reduces the amount of transformation that other clients 
will need to do when they eventually receive its operations. 





4 Membership Management 


Document membership in SPORC is controlled at the 
level of users, each of which is associated with a public- 
private key pair. When a document is first created, only the 
user that created it has access. Subsequently, privileged 
users can change the document’s access control list (ACL) 
by submitting Modi fyUserOp meta-operations, which 
get added to the document’s history (covered by its hash 
chain), much like normal operations. 

A user can be given one of three privilege levels: 
reader, which entitles the user to decrypt the document but 
not to submit new operations; editor, which entitles the 
user to read the document and to submit new operations 
(except those that change the ACL); and administrator, 
which grants the user full access, including the ability 
to invite new users and remove existing users. Because 
ModifyUserOps are not encrypted, a non-malicious 
server will immediately reject operations from users with 
insufficient privileges. But because the server is untrusted, 
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every client maintains its own copy of the ACL, based 
on the history’s Modi fyUserOps, and refuses to apply 
operations that came from unauthorized users. 


4.1 Encrypting Document Operations 


To prevent eavesdropping by the server or unapproved 
users, the payloads of document operations are encrypted 
under a symmetric key known only to the document’s 
current members. More specifically, to create a new docu- 
ment, the creator generates a random AES key, encrypts it 
under her own public key, and then writes the encrypted 
key to the document’s initial create meta-operation. To add 
new users, an administrator submits a ModifyUserOp 
that includes the document’s AES key encrypted under 
each of the new users’ public keys. 

If users are removed, the AES key must be changed 
so that the removed users will not be able to decrypt sub- 
sequent operations. To do so, an administrator picks a 
new random AES key, encrypts it under the public keys 
of all the remaining participants, and then submits the 
encrypted keys as part of the ModifyUserOp.? This 
meta-operation also includes an encryption of the old AES 
key under the new AES key. This enables later users to 
learn earlier keys and thus decrypt old operations, without 
requiring the operations to be re-encrypted. 

SPORC’s model ensures proper access control over 
operations, based on how it tracks potential causality 
through prevSeqNo dependencies. Operations concurrent 
to a ModifyUserOp removal may be ordered before 
it and remain accessible to the user. However, once a 
client sees the removal meta-operation in its committed 
history any subsequent operation the client submits will 
be inaccessible to the removed user. 


4.2 Barrier Operations 


Concurrency also poses a challenge to membership man- 
agement. Consider the situation when two clients con- 
currently issue ModifyUserOps that both attempt to 
change the current symmetric key. If the server naively 
scheduled one after the other, then the continuous chain 
of old keys encrypted under new ones would be broken. 
To address situations like this, we introduce a primitive 
called a barrier operation. When the server receives an 
operation that is marked “barrier” and assigns it global 
sequence number }, the server requires that every sub- 
sequent operation have a prevSeqNo > b. Subsequent 
operations that do not are rejected and must be revised and 
resubmitted with a later prevSeqNo. In this way, the server 


3Tn our current implementation, the size of a Modi fyUserOp may 
be linear in the number of users participating in the document, because 
the operation may contain the current AES key encrypted under each 
of the users’ RSA public keys. An optimization to achieve constant- 
sized ModifyUserOps could instead use a space-efficient broadcast 
encryption scheme [6]. 
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can force all future operations to depend on the barrier 
operation.‘ 

Let us reconsider the example of two concurrent 
ModifyUseroOps, op, and opg, that are marked as barri- 
ers. Suppose that the server received op, first and assigned 
it sequence number b. Since the operations were submit- 
ted concurrently, op2’s prevSeqNo will necessarily be less 
than b, and op, will be rejected. The client attempting to 
send op2 must wait until it receives op;, at which time it 
will adjust op2 to depend on this operation before resub- 
mitting (i.e., encrypt op;’s key under its new key, and set 
op2’s prevSeqNo => b). As a result, the chain of old keys 
encrypted under new ones will be preserved. 

Barrier operations have uses beyond membership man- 
agement. For example, as described next, they are useful 
in implementing checkpoints on the history. 


5 Extensions 


This section describes extensions to the basic SPORC 
protocols: supporting checkpoints to reduce the size re- 
quirements for storing the committed history (Section 5.1), 
detecting forks through out-of-band communication (Sec- 
tion 5.2), and recovering from forks by replaying and pos- 
sibly transforming forked operations (Section 5.3). Our 
current prototype does not yet implement these extensions, 
however. 


5.1 Checkpoints 


In order to reach a document’s latest state, a new client in 
our current implementation must download and apply the 
entire history of committed operations. It would be more 
efficient for a new client to instead download a check- 
point of operations—a compact representation of the doc- 
ument’s state, akin to each client’s local state—and then 
only apply individual committed operations since the last 
checkpoint. Much as SPORC servers cannot transform 
operations, they similarly cannot perform checkpoints; 
SPORC once again has individual clients play this role. 
To support checkpoints, each client maintains a com- 
pacted version of the committed history up to the most 
recent barrier operation. When a client is ready to upload 
a checkpoint to the server, it encrypts this compacted his- 
tory under the current document key. It then creates a new 
CheckpointOp meta-operation containing the hash of 
the encrypted checkpoint data and submits it into the his- 
tory. Requiring the checkpoint data to end in a barrier 
operation ensures that clients that later use the checkpoint 
will be able to ignore the history before the barrier without 
having to worry that they will need to perform OT transfor- 
mations involving that old history. After all, no operation 


4To prevent a malicious server from violating the rules governing 
barrier operations, an operation’s “‘barrier” flag is covered by the opera- 
tion’s signature, and all clients verify that the server is handling barrier 
operations correctly. 
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after a barrier can depend on an operation before it. If the 
most recent barrier is too old, the client can submit a new 
null barrier operation before creating the checkpoint.> 

Checkpoints raise new security challenges, however. A 
client that lacks the full history cannot verify the hash 
chain all the way back to the document’s creation. It can 
verify that the operations it has chain together correctly, 
but the first operation in its history (i.e., the barrier op- 
eration) is “dangling,” and its prevHC value cannot be 
verified. This is not a problem if the client knows in 
advance that the Checkpoint Op is part of the valid his- 
tory, but this is difficult to verify. The CheckpointOp 
will be signed by a user, and users who have access to the 
document are assumed to be trusted; but there must be a 
way to verify that the signing user had permission to ac- 
cess the document at the time the checkpoint was created. 
Unfortunately, without access to a verifiable history of 
individual ModifyUserOps going back the beginning 
of the document, a client deciding whether to accept a 
checkpoint has no way to be certain of which users were 
actually members of the document at any given time. 

To address these issues, we propose that the server and 
clients maintain a meta-history, alongside the committed 
history, that is comprised solely of meta-operations. Meta- 
operations are included in the committed history as before, 
but each one also has a prevMetaSeqNo pointer to a prior 
element of the meta-history along with a corresponding 
prevMetaHC field. Each client maintains a separate hash 
chain over the meta-history and performs the same consis- 
tency checks on the meta-history that it performs on the 
committed history. 

When a client joins, before it downloads a check- 
point, it requests the entire meta-history from the server. 
The meta-history provides the client with a fork* con- 
sistent view of the sequence of ModifyUserOps and 
CheckpointOps that indicates whether the check- 
point’s creator was an authorized user when the checkpoint 
was created. Moreover, the cost of downloading the entire 
meta-history is likely to be low because meta-operations 
are rare relative to document operations. 


5.2. Checking for Forks Out-of-Band 


Fork* consistency does not prevent a server from forking 
clients’ state, as long as the server never tells any member 
of one fork about any operation done by a member of an- 
other fork. To detect such forks, clients can exchange state 
information out-of-band, for example, by direct socket 


SHaving the checkpoint data end in an earlier barrier operation 
is better than making CheckpointOps into barriers themselves. If 
Checkpoint Ops were barriers, then either the client making the check- 
point would have to “lock” the history to prevent new operations from 
being admitted before the checkpoint was uploaded, or the system would 
have to reject checkpoints that did not reflect the latest state, which could 
potentially lead to livelock. 


connections, email, instant messaging, or posting on a 
shared server or DHT service. 

Clients can exchange messages of the form (c, d, 5, hs), 
asserting that in client c’s view of document d, the hash 
chain value as of sequence number s is equal to hs. On 
receiving such a message, a client compares its own hash 
chain value at sequence number s with h,, and if the 
values differ, it knows a fork has occurred. If the recipient 
does not yet have operations up to sequence number s, it 
requests them from the server; a well behaved server will 
always be able to supply the missing operations. 

These out-of-band messages should be digitally signed 
to prevent forgery. To prevent out-of-band messages from 
leaking information about which clients are collaborat- 
ing on a document, and to prevent a client from falsely 
claiming that it was invited into the document by a forked 
client, the out-of-band messages should be encrypted and 
MACed with a separate set of symmetric keys that are 
known only to nodes that have been part of the document.° 
These keys might be conveyed in the first operation of the 
document’s history. 


5.3. Recovering from a Fork 


A benefit of combining OT and fork* consistency is that 
we can use OT to recover from forks. OT is well suited 
to this task because, in normal operation, OT clients are 
essentially creating small forks whenever they optimisti- 
cally apply operations locally, and resolving these forks 
when they transform operations to restore consistency. In 
this section, we sketch an algorithm that a pair of forked 
clients can use to merge their divergent histories into a 
consistent whole. This pairwise algorithm can be repeated 
as necessary to resolve forks involving multiple clients, or 
multi-way forks. 

The basic idea of the algorithm is that the two clients 
will abandon the malicious server and agree on a new 
one. Both clients will roll back their histories to their last 
common point before the fork, and one of them will upload 
the common history, up to the fork point, to the new server. 
Finally, each client will resubmit the operations that it 
saw after the fork. OT will ensure that these resubmitted 
operations are merged safely so that both nodes end up in 
the same state. 

The situation becomes more complicated if the same 
operation appears in both histories. We cannot just remove 
the duplicate because later operations in the sequence may 
depend on it. Instead, we must cancel it out. To make 
this possible, we require that all operations be invertible: 


6A client falsely claiming to have been invited into the document 
in another fork will eventually be detected when the other clients try 
to recover from the (false) fork. However, this is expensive so we 
would prefer to avoid it. By protecting the out-of-band messages with 
symmetric keys known only to clients who have been in the document at 
some point, we reduce the set of potential liars substantially. 
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we must be able to construct an inverse operation op™! 


such that applying op followed by op~* results in a no- 
op. This is often easy to do in practice by having each 
operation store enough information about the prior state 
to determine what the inverse should be. For example, a 
delete operation can store the information that was deleted, 
enabling the creation of an insert operation as the inverse. 
To cancel each duplicate, we cannot simply splice its 
inverse into the history right after it for the same reason 
that we cannot just remove the duplicate. Instead, we 
compute the inverse operation and then transform it past 
all of the operations following the duplicate. This process 
results in an operation that has the effect of canceling out 
the duplicate when appended to the end of the sequence. 


6 Implementation 


SPORC provides a framework for building collaborative 
applications that need to synchronize different kinds of 
state between clients. It consists of a generic server im- 
plementation and client-side libraries that implement the 
SPORC protocol, including the sending, receiving, en- 
cryption, and transformation of operations, as well as the 
necessarily consistency checks and document membership 
management. To build applications within the SPORC 
framework, a developer only needs to implement client- 
side functionality that (i) defines a data type for SPORC 
operations, (ii) defines how to transform a pair of opera- 
tions, and (iii) defines how to combine multiple document 
operations into a single one. The server need not be modi- 
fied, as it always deals with operations on encrypted data. 


6.1 Variants 


We implemented two variants of SPORC: a command-line 
version in which both client and server are stand-alone 
applications, and a web-based version with a browser- 
based client and a Java servlet. The command-line ver- 
sion, which we use for later microbenchmarks, is written 
in approximately 5500 lines of Java code (per SLOC- 
Count [46]) and, for network communication, uses the 
socket-based RPC library in the open-source release of 
Google Wave [16]. Because the server’s role is limited to 
ordering and storing client-supplied operations, its basic 
implementation is simple and only requires approximately 
300 lines of code. 

The web-based version shares the majority of its code 
with the command-line variant. The server just encap- 
sulates the command-line server functionality in a Java 
servlet. The client consists almost entirely of JavaScript 
code that was automatically generated using the Java-to- 
JavaScript compiler included with the Google Web Toolkit 
(GWT) [12]. Network communication uses a combina- 
tion of the GWT RPC framework, which wraps browser 
XmlHttpRequests, and the GWTEventService [37], 
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which allows the server to push messages to the browser 
asynchronously through a long-lived HTTP connection 
(the so-called “Comet” style of web programming). This 
prototype could be extended with HTMLS’s offline stor- 
age to provide disconnected operation. 

The client’s use cryptographic module was its only 
component that could not be translated to JavaScript. 
JavaScript remains too slow to implement public key cryp- 
tography efficiently, and browsers lack both secure storage 
for cryptographic keys and a secure pseudorandom num- 
ber generator for key generation. To work around these 
limitations, we encapsulate our cryptographic module in a 
Java applet and implement JavaScript-to-Java communica- 
tion using the LiveConnect API [28] (a strategy employed 
in [2, 47]). Our experience suggests it would be beneficial 
for browsers to provide a JavaScript API that supported 
basic cryptographic primitives. 


6.2 Building SPORC Applications 


To demonstrate the usefulness of our framework, we 
built two prototype applications: a causally-consistent 
key-value store and a web-based collaborative text editor. 
The key-value store keeps a simple dictionary—mapping 
strings to strings—synchronized across a set of partici- 
pating clients. To implement it, we defined a data type 
that represents a list of keys to update or remove. We 
wrote a simple transformation function that implements a 
“last writer wins” policy, as well as a composition function 
that merges two lists of key updates in a straightforward 
manner. Overall, the application-specific portion of the 
key-value store only required 280 lines of code. 

The collaborative editor allows multiple users to modify 
a text document simultaneously via their web browsers 
and see each other’s changes in near real-time. It pro- 
vides a user experience similar to Google Docs [14] and 
EtherPad [13], but, unlike those services, it does not re- 
quire a trusted server. To implement it, we were able to 
reuse the data types and the transformation and compo- 
sition functions from the open-source release of Google 
Wave. Although Wave is a server-centric OT system with- 
out SPORC’s level of security and privacy, we were able 
to adapt its components for our framework with only 550 
lines of wrapper code. 


7 Experimental Evaluation 


The user-facing collaborative applications for which 
SPORC was designed—e.g., word processing, calendar- 
ing, and instant messaging—require latency that is low 
enough for human users to see each others’ updates in 
real-time. But unlike file or storage systems, their primary 
goal is not high throughput. In this section, we present 
the results of several microbenchmarks of our Java-based 
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Figure 2: Latency of SPORC with a single client writer 


command-line version, to demonstrate SPORC’s useful- 
ness for this class of applications. 


We performed our experiments on a cluster of five com- 

modity machines, each with eight 2.3 GHz AMD Opteron 
cores and 8 GB of RAM, that were connected by gigabit 
switched Ethernet. In each of our experiments, we ran a 
single server instance on its own machine, along with vary- 
ing numbers of client instances. To scale our system to 
moderate numbers of clients, in many of our experiments, 
we ran multiple client instances on each machine. We ran 
all the experiments under the OpenJDK Java VM (version 
IcedTea6 1.6). For RSA signatures, however, we used 
the Network Security Services for Java (JSS) library from 
the Mozilla Project [29] because, unlike Java’s default 
cryptography library, it is implemented in native code and 
offers considerably better performance. 
Latency. To measure SPORC’s latency, we conducted 
three minute runs with between one and sixteen clients for 
both key-value and text editor operations. We tested our 
system under both low-load conditions, where only one of 
the clients submitted new operations (once every 200 ms), 
and high-load conditions, where all of the clients were 
writers. We measured latency by computing the mean 
time that an operation was “in flight”: from the time that it 
was generated by the sender’s application-level code, until 
the time it was delivered to the recipient’s application. 
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Figure 3: Latency of SPORC with all clients issuing writes 


Under low-load conditions with only one writer, we 
would expect the load on each client to remain constant as 
the number of clients increases, because each additional 
client does not add to the total number of operations in 
flight. We would, however, expect to see server latency 
increase modestly, as the server has to send operations 
to increasing numbers of clients. Indeed, as shown in 
Figure 2, the latency due to server processing increased 
from under | ms with one client to over 3 ms with sixteen 
clients, while overall latency increased modestly from 
approximately 19 ms to approximately 25 ms.’ 

On the other hand, when every client is a writer, we 
would expect the load on each client to increase with the 
number of clients. As expected, Figure 3 shows that with 
sixteen clients under loaded conditions, overall latency 
is higher: approximately 26 ms for key-value operations 
and 33 ms for the more expensive text-editor operations. 
The biggest contributor to this increase is client queue- 
ing, which is primarily the time that a client’s received 
operations spend in its incoming queue before being pro- 
cessed. Queueing delay begins at around 3 ms for one 


7Figure 2 also shows small increases in the latency of client pro- 
cessing and queuing when the number of clients was greater than four. 
These increases are most likely due to the fact that, when we conducted 
experiments with more than four clients, we ran multiple client instances 
per machine. 
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Figure 4: Server throughput as a function of payload size. 


client and then increases steadily until it levels off at ap- 
proximately 8 ms for the key-value application and 14 ms 
for the text editor. Despite this increase, Figure 3 demon- 
strates that SPORC successfully supports real-time col- 
laboration for moderately-sized groups, even under load. 
As these experiments were performed on a local-area net- 
work, a wide-area deployment of SPORC would see an 
increase in latency that reflects the correspondingly higher 
network round-trip-time. 


Figures 2 and 3 also show that client-side cryptographic 
operations account for a large share of overall latency. 
This occurs because SPORC performs a 2048-bit RSA sig- 
nature on every outgoing operation and because Mozilla 
JSS, while better than Java’s cryptography built-in library, 
still requires about 10 ms to compute a single signature. 
Using an optimized implementation of a more efficient sig- 
nature scheme, such as ESIGN, could improve the latency 
of signatures by nearly two orders of magnitude [24]. 


Server throughput. We measured the server’s maximum 
throughput by saturating the server with operations using 
100 clients. These particular clients were modified to 
allow them to have more than one operation in flight at 
a time. Figure 4 shows server throughput as a function 
of payload size, measured in terms of both operations 
per second and MB per second. Each data point was 
computed by performing a three minute run of the system 
and then taking the median of the mean throughput of 
each one second interval. The error bars represent the 5th 
and 95th percentiles. The figure shows that, as expected, 
when payload size increases, the number of operations per 
second decreases, because each operation requires more 
time to process. But, at the same time, data throughput 
(MB/s) increases, because the processing overhead per 
byte decreases. 


Client time-to-join. Because our current implementa- 
tion lacks the checkpoints of Section 5.1, when a client 
joins the document, it must first download each individ- 
ual operation in the committed history. To evaluate the 
cost of joining an existing document, we first filled the 
history with varying numbers of operations. Then, we 
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Figure 5: Client time-to-join given a variable length history 


measured the time it took for a new client to receive the 
shared decryption key and download and process all of 
the committed operations. We performed two kinds of 
experiments: one where the client started with an empty 
local state, and a second in which the client had 2000 
pending operations that had yet to be submitted to the 
server. The purpose of the second test was to measure 
how long it would take for a client that had been work- 
ing offline for some length of time to synchronize with 
the current state of the document. Synchronization re- 
quires the client to transform its pending operations past 
the committed operations that the client has not seen; thus, 
it is more costly than joining a document with an empty 
local state. Notably, since the-fork recovery algorithm 
sketched in Section 5.3 relies on the same mechanism that 
is used to synchronize clients that have been offline—it 
treats operations after the fork as if they were pending 
uncommitted operations—this test also sheds light on the 
cost of recovering from a fork. 

Figure 5 shows time-to-join as a function of history size. 
Each data point represents the median of ten runs, and the 
error bars correspond to the 10th and 90th percentiles. We 
find that time-to-join is linear in the number of committed 
operations. It takes a client with an empty local state 
approximately one additional second to join a document 
for every additional 1000 committed operations. 

In addition, the figure shows that the time-to-join with 
a significant number of pending operations varies greatly 
by application. In the key-value application, the transfor- 
mation function is cheap, because it is effectively a no-op 
if the given operations do not affect the same keys. As 
a result, the cost of transforming 2000 operations adds 
little to the time-to-join. By contrast, the text editor’s 
more complex transformation function adds a non-trivial, 
although still acceptable, amount of overhead. 


8 Related Work 


Real-time “groupware” collaboration systems have 
adapted classic distributed systems techniques for time- 
stamping and ordering (e.g., [4, 5, 20]), but have also 
introduced novel techniques to automatically resolve 
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conflicts between concurrent operations in an intention- 
preserving manner (e.g., [11, 18, 33, 38, 39, 40, 41, 42]). 
These techniques form the basis of SPORC’s client syn- 
chronization mechanism and allow it to support slow or 
disconnected networks. Several systems also use OT to im- 
plement undo functionality (e.g., [32, 33]), and SPORC’s 
fork recovery algorithm draws upon these approaches. 
Furthermore, as an alternative to OT, Bayou [43] allows 
applications to specify conflict detection and merge pro- 
tocols to reconcile concurrent operations. Most of these 
protocols focus on decentralized settings and use n-way 
reconciliation, but several well-known systems use a cen- 
tral server to simplify synchronization between clients 
(including Jupiter [30] and Google Wave [44]). SPORC 
also uses a central server for ordering and storage, but al- 
lows the server to be untrusted. Secure Spread [3] presents 
several efficient message encryption and key distribution 
architectures for such client-server group collaboration 
settings. But unlike SPORC, it relies on trusted servers 
that can generate keys and re-encrypt messages as needed. 

Traditionally, distributed systems have defended against 
potentially malicious servers by replicating functional- 
ity and storage over multiple servers. Protocols, such 
as Byzantine fault tolerant (BFT) replicated state ma- 
chines [9, 21, 48] or quorum systems [1, 26], can then 
guarantee safety and liveness, provided that some fraction 
of these servers remain non-faulty. Modern approaches 
optimize performance by, for example, concurrently exe- 
cuting independent operations [19], permitting client-side 
speculation [45], or supporting eventual consistency [35]. 
BFT protocols face criticism, however, because when the 
number of correct servers falls below a certain threshold 
(typically two-thirds), they cannot make progress. 

Subsequently, variants of fork consistency protocols 
(e.g., [7, 27, 31]) have addressed the question of how 
much safety one can achieve with a single untrusted server. 
These works demonstrate that server equivocation can al- 
ways be detected unless the server permanently forks the 
clients into groups that cannot communicate with each 
other. SUNDR [24] and FAUST [8] use these fork consis- 
tency techniques to implement storage protocols on top 
of untrusted servers. Other systems, such as A2M [10] 
and TrInc [22], rely on trusted hardware to detect server 
equivocation. BFT2F [23] combines techniques from 
BFT replication and SUNDR to achieve fork* consistency 
with higher fractions of faulty nodes than BFT can resist. 
SPORC borrows from the design of BFT2F in its use of 
hash chains to limit equivocation, but unlike BFT2F or 
any of these other systems, SPORC allows disconnected 
operation and enables clients to recover from server equiv- 
ocation, not just detect it. 

Like SPORC, two very recent systems, Venus [34] and 
Depot [25], allow clients to use a cloud resource without 
having to trust it, and they also support some degree of 


disconnected operation. Venus provides strong consis- 
tency in the face of a potentially malicious server, but 
does not support applications other than key-value storage. 
Furthermore, unlike SPORC, it requires the majority of 
a “core set” of clients to be online in order to achieve 
most of its consistency guarantees. In addition, although 
members may be added dynamically to the group editing 
the shared state, it does not allow access to be revoked, 
nor does it provide a mechanism for distributing encryp- 
tion keys. Depot, on the other hand, does not rely on 
the availability of a “core set” of clients and supports var- 
ied applications. Moreover, similar to SPORC, it allows 
clients to recover from malicious forks using the same 
mechanism that it uses to keep clients synchronized. But 
rather than providing a means for reconciling conflicting 
operations as SPORC does with OT, Depot relies on the 
application for conflict resolution. Because Depot treats 
clients and servers identically, it can also tolerate faulty 
clients, in addition to faulty servers. Unlike SPORC, how- 
ever, Depot does not consider dynamic access control or 
confidentiality. 


9 Conclusion 


Our original goal for SPORC was to design a general 
framework for web-based group collaboration that could 
leverage cloud resources, but not be beholden to them 
for privacy guarantees. This goal leads to a design in 
which servers only store encrypted data, and each client 
maintains its own local copy of the shared state. But when 
each client has its own copy of the state, the system must 
keep them synchronized, and operational transformation 
provides a way do to so. OT enables optimistic updates 
and automatically reconciles clients’ conflicting states. 

Supporting applications that need timely commits re- 
quires a central server. But if we do not trust the server 
to preserve data privacy, we should not trust it to commit 
operations correctly either. This requirement led us to 
employ fork* consistency techniques to allow clients to 
detect server equivocation about the order of committed 
operations. But beyond the benefits that each provides 
independently, this work shows that OT and fork* consis- 
tency complement each other well. Whereas prior systems 
that enforced fork* consistency alone were only able to 
detect malicious forks, by combining fork* consistency 
with OT, SPORC can recover from them using the same 
mechanism that keeps clients synchronized. 

In addition to these conceptual contributions, we present 
a membership management architecture that provides dy- 
namic access control and key distribution with an un- 
trusted server, even in the face of concurrency. Finally, 
we also demonstrate the flexibility of our design by imple- 
menting two applications: a causally-consistent key-value 
store and a browser-based collaborative text editor. 
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Abstract 


Computer networks lack a general control paradigm, 
as traditional networks do not provide any network- 
wide management abstractions. As a result, each new 
function (such as routing) must provide its own state 
distribution, element discovery, and failure recovery 
mechanisms. We believe this lack of a common control 
platform has significantly hindered the development of 
flexible, reliable and feature-rich network control planes. 

To address this, we present Onix, a platform on top of 
which a network control plane can be implemented as a 
distributed system. Control planes written within Onix 
operate on a global view of the network, and use basic 
state distribution primitives provided by the platform. 
Thus Onix provides a general API for control plane 
implementations, while allowing them to make their own 
trade-offs among consistency, durability, and scalability. 


1 Introduction 


Network technology has improved dramatically over 
the years with line speeds, port densities, and perfor- 
mance/price ratios all increasing rapidly. However, 
network control plane mechanisms have advanced at a 
much slower pace; for example, it takes several years 
to fully design, and even longer to widely deploy, a 
new network control protocol.'! In recent years, as 
new control requirements have arisen (e.g., greater scale, 
increased security, migration of VMs), the inadequacies 
of our current network control mechanisms have become 
especially problematic. In response, there is a growing 
movement, driven by both industry and academia, towards 
a control paradigm in which the control plane is decoupled 
from the forwarding plane and built as a distributed 
system.” 

In this model, a network-wide control platform, run- 
ning on one or more servers in the network, oversees a 
set of simple switches. The control platform handles state 
distribution — collecting information from the switches 


*Nicira Networks 

+ Google 

'NEC 

8International Computer Science Institute (ICSI) & UC Berkeley 

'See, for example, TRILL [32], a recent success story which has 
been in the design and specification phase for over 6 years. 

?The industrial efforts in this area are typically being undertaken by 
entities that operate large networks, not by the incumbent networking 
equipment vendors themselves. 


and distributing the appropriate control state to them, as 
well as coordinating the state among the various platform 
servers — and provides a programmatic interface upon 
which developers can build a wide variety of management 
applications. (The term “management application” refers 
to the control logic needed to implement management 
features such as routing and access control.)* For the 
purposes of this paper, we refer to this paradigm for 
network control as Software-Defined Networking (SDN). 

This is in contrast to the traditional network control 
model in which state distribution is limited to link and 
reachability information and the distribution model is 
fixed. Today a new network control function (e.g., 
scalable routing of flat intra-domain addresses [21]) 
requires its own distributed protocol, which involves first 
solving a hard, low-level design problem and then later 
overcoming the difficulty of deploying this design on 
switches. As a result, networking gear today supports 
a baroque collection of control protocols with differing 
scalability and convergence properties. On the other hand, 
with SDN, a new control function requires writing control 
logic on top of the control platform’s higher-level API; the 
difficulties of implementing the distribution mechanisms 
and deploying them on switches are taken care of by the 
platform. Thus, not only is the work to implement a 
new control function reduced, but the platform provides 
a unified framework for understanding the scaling and 
performance properties of the system. 

Said another way, the essence of the SDN philosophy 
is that basic primitives for state distribution should be 
implemented once in the control platform rather than 
separately for individual control tasks, and should use 
well-known and general-purpose techniques from the dis- 
tributed systems literature rather than the more specialized 
algorithms found in routing protocols and other network 
control mechanisms. The SDN paradigm allows network 
system implementors to use a single control platform 
to implement a range of control functions (e.g., routing, 
traffic engineering, access control, VM migration) over a 
spectrum of control granularities (from individual flows 
to large traffic aggregates) in a variety of contexts (e.g., 
enterprises, datacenters, WANs). 


3Just to be clear, we only imagine a single “application” being used 
in any particular deployment; this application might address several 
issues, such as routing and access control, but the control platform 
is not designed to allow multiple applications to control the network 
simultaneously (unless the network is “physically sliced’ [28]). 
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Because the control platform simplifies the duties of 
both switches (which are controlled by the platform) and 
the control logic (which is implemented on top of the 
platform) while allowing great generality of function, 
the control platform is the crucial enabler of the SDN 
paradigm. The most important challenges in building a 
production-quality control platform are: 


e@ Generality: The control platform’s API must allow 
management applications to deliver a wide range of 
functionality in a variety of contexts. 


e Scalability: Because networks (particularly in the 
datacenter) are growing rapidly, any scaling limita- 
tions should be due to the inherent problems of state 
management, not the implementation of the control 
platform. 


e Reliability: The control platform must handle equip- 
ment (and other) failures gracefully. 


e Simplicity: The control platform should simplify the 
task of building management applications. 


e Control plane performance: The control platform 
should not introduce significant additional control 
plane latencies or otherwise impede management 
applications (note that forwarding path latencies 
are unaffected by SDN). However, the requirement 
here is for adequate control-plane performance, not 
optimal performance. When faced with a tradeoff 
between generality and control plane performance, 
we try to optimize the former while satisficing the 
latter.* 


While a number of systems following the basic 
paradigm of SDN have been proposed, to date there has 
been little published work on how to build a network 
control platform satisfying all of these requirements. 
To fill this void, in this paper we describe the design 
and implementation of such a control platform called 
Onix (Sections 2-5). While we do not yet have extensive 
deployment experience with Onix, we have implemented 
several management applications which are undergoing 
production beta trials for commercial deployment. We 
discuss these and other use cases in Section 6, and present 
some performance measures of the platform itself in 
Section 7. 

Onix did not arise de novo, but instead derives from 
a long history of related work, most notably the line 


“There might be settings where optimizing control plane 
performance is crucial. For example, if one cannot use backup paths for 
improved reliability, one can only rely on a fine-tuned routing protocol. 
In such settings one might not use a general-purpose control platform, 
but instead adopt a more specialized approach. We consider such settings 
increasingly uncommon. 
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Figure 1: There are four components in an Onix controlled 
network: managed physical infrastructure, connectivity 
infrastructure, Onix, and the control logic implemented by the 
management application. This figure depicts two Onix instances 
coordinating and sharing (via the dashed arrow) their views of 
the underlying network state, and offering the control logic a 
read/write interface to that state. Section 2.2 describes the NIB. 


of research that started with the 4D project [15] and 
continued with RCP [3], SANE [6], Ethane [5] and 
NOX [16] (see [4,23] for other related work). While all of 
these were steps towards shielding protocol design from 
low-level details, only NOX could be considered a control 
platform offering a general-purpose API.> However, NOX 
did not adequately address reliability, nor did it give 
the application designer enough flexibility to achieve 
scalability. 

The primary contributions of Onix over existing work 
are thus twofold. First, Onix exposes a far more general 
API than previous systems. As we describe in Section 6, 
projects being built on Onix are targeting environments 
as diverse as the WAN, the public cloud, and the 
enterprise data center. Second, Onix provides flexible 
distribution primitives (such as DHT storage and group 
membership) allowing application designers to implement 
control applications without re-inventing distribution 
mechanisms, and while retaining the flexibility to make 
performance/scalability trade-offs as dictated by the 
application requirements. 


2 Design 


Understanding how Onix realizes a production-quality 
control platform requires discussing two aspects of its 
design: the context in which it fits into the network, and 
the API it provides to application designers. 


2.1 Components 


There are four components in a network controlled by 
Onix, and they have very distinct roles (see Figure 1). 


e Physical infrastructure: This includes network 
switches and routers, as well as any other network 
elements (such as load balancers) that support 
an interface allowing Onix to read and write the 


5Only a brief sketch of NOX has been published; in some ways, 
this paper can be considered the first in-depth discussion of a NOX-like 
design, albeit in a second-generation form. 
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state controlling the element’s behavior (such as 
forwarding table entries). These network elements 
need not run any software other than that required 
to support this interface and (as described in the 
following bullet) achieve basic connectivity. 


e Connectivity infrastructure: The communication 
between the physical networking gear and Onix (the 
“control traffic’”’) transits the connectivity infrastruc- 
ture. This control channel may be implemented 
either in-band (in which the control traffic shares 
the same forwarding elements as the data traffic on 
the network), or out-of-band (in which a separate 
physical network is used to handle the control 
traffic). The connectivity infrastructure must sup- 
port bidirectional communication between the Onix 
instances and the switches, and optionally supports 
convergence on link failure. Standard routing 
protocols (such as IS-IS or OSPF) are suitable for 
building and maintaining forwarding state in the 
connectivity infrastructure. 


e Onix: Onix is a distributed system which runs on 
a cluster of one or more physical servers, each of 
which may run multiple Onix instances. As the 
control platform, Onix is responsible for giving 
the control logic programmatic access to the net- 
work (both reading and writing network state). In 
order to scale to very large networks (millions of 
ports) and to provide the requisite resilience for 
production deployments, an Onix instance is also 
responsible for disseminating network state to other 
instances within the cluster. 


e Control logic: The network control logic is imple- 
mented on top of Onix’s API. This control logic 
determines the desired network behavior; Onix 
merely provides the primitives needed to access the 
appropriate network state. 


These are the four basic components of an SDN- 
based network. Before delving into the design of Onix, 
we should clarify our intended range of applicability. 
We assume that the physical infrastructure can forward 
packets much faster (typically by two or more orders of 
magnitude) than Onix (or any general control platform) 
can process them; thus, we do not envision using Onix to 
implement management functions that require the control 
logic to know about per-packet (or other rapid) changes 
in network state. 


2.2. The Onix API 


The principal contribution of Onix is defining a useful and 
general API for network control that allows for the de- 
velopment of scalable applications. Building on previous 
work [16], we designed Onix’s API around a view of the 


physical network, allowing control applications to read 
and write state to any element in the network. Our API 
is therefore data-centric, providing methods for keeping 
state consistent between the in-network elements and the 
control application (running on multiple Onix instances). 

More specifically, Onix’s API consists of a data model 
that represents the network infrastructure, with each 
network element corresponding to one or more data 
objects. The control logic can: read the current state 
associated with that object; alter the network state by 
operating on these objects; and register for notifications 
of state changes to these objects. In addition, since 
Onix must support a wide range of control scenarios, 
the platform allows the control logic to customize (in a 
way we describe later) the data model and have control 
over the placement and consistency of each component 
of the network state. 

The copy of the network state tracked by Onix is stored 
in a data structure we call the Network Information Base 
(NIB), which we view as roughly analogous to the Rout- 
ing Information Base (RIB) used by IP routers. However, 
rather than just storing prefixes to destinations, the NIB is 
a graph of all network entities within a network topology. 
The NIB is both the heart of the Onix control model and 
the basis for Onix’s distribution model. Network control 
applications are implemented by reading and writing 
to the NIB (for example modifying forwarding state or 
accessing port counters), and Onix provides scalability 
and resilience by replicating and distributing the NIB 
between multiple running instances (as configured by the 
application). 

While Onix handles the replication and distribution of 
NIB data, it relies on application-specific logic to both 
detect and provide conflict resolution of network state as it 
is exchanged between Onix instances, as well as between 
an Onix instance and a network element. The control 
logic may also dictate the consistency guarantees for state 
disseminated between Onix instances using distributed 
locking and consensus algorithms. 

In order to simplify the discussion, we assume that 
the NIB only contains physical entities in the network. 
However, in practice it can easily be extended to support 
logical elements (such as tunnels). 


2.3. Network Information Base Details 


At its most generic level, the NIB holds a collection of 
network entities, each of which holds a set of key-value 
pairs and is identified by a flat, 128-bit, global identifier. 
These network entities are the base structure from which 
all types are derived. Onix supports stronger typing 
through typed entities, representing different network 
elements (or their subparts). Typed entities then contain 
a predefined set of attributes (using the key-value pairs) 
and methods to perform operations over those attributes. 
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Figure 2: The default network entity classes provided by 
Onix’s API. Solid lines represent inheritance, while dashed lines 
correspond to referential relation between entity instances. The 
numbers on the dashed lines show the quantitative mapping 
relationship (e.g., one Link maps to two Ports, and two 
Ports can map to the same Link). Nodes, ports and links 
constitute the network topology. All entity classes inherit the 
same base class providing generic key-value pair access. 


For example, there is a Port entity class that can 
belong to a list of ports in a Node entity. Figure 2 
illustrates the default set of typed entities Onix provides — 
all typed entities have a common base class limited to 
generic key-value pair access. The type-set within Onix is 
not fixed and applications can subclass these basic classes 
to extend Onix’s data model as needed.® 

The NIB provides multiple methods for the control 
logic to gain access to network entities. It maintains an 
index of all of its entities based on the entity identifier, 
allowing for direct querying of a specific entity. It also 
supports registration for notifications on state changes 
or the addition/deletion of an entity. Applications can 
further extend the querying capabilities by listening for 
notifications of entity arrivals and maintaining their own 
indices. 

The control logic for a typical application is therefore 
fairly straightforward. It will register to be notified on 
some state change (e.g., the addition of new switches and 
ports), and once the notification fires, it will manipulate 
the network state by modifying the key-value pairs of the 
affected entities. 

The NIB provides neither fine-grained nor distributed 
locking mechanisms, but rather a mechanism to request 
and release exclusive access to the NIB data structure 
of the local instance. While the application is given the 
guarantee that no other thread is updating the NIB within 
the same controller instance, it is not guaranteed the 
state (or related state) remains untouched by other Onix 
instances or network elements. For such coordination, 
it must use mechanisms implemented externally to the 
NIB. We describe this in more detail in Section 4; for now, 
we assume this coordination is mostly static and requires 
control logic involvement during failure conditions. 

All NIB operations are asynchronous, meaning that 
updating a network entity only guarantees that the update 
message will eventually be sent to the corresponding 


®Subclassing also enables control over how the key-value pairs are 
stored within the entity. Control logics may prefer different trade-offs 
between memory and CPU usage. 
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Category Purpose 








Find entities. 
Create and remove entities. 
Inspect and modify entities. 


Query 
Create, destroy 
Access attributes 


Notifications Receive updates about changes. 
Synchronize Wait for updates being exported to 
network elements and controllers. 
Configuration Configure how state is imported 
to and exported from the NIB. 
Pull Ask for entities to be imported 





on-demand. 
Table 1: Functions provided by the Onix NIB API. 


network element and/or other Onix instances — no 
ordering or latency guarantees are given. While this 
has the potential to simplify the control logic and make 
multiple modifications more efficient, often it is useful to 
know when an update has successfully completed. For 
instance, to minimize disruption to network traffic, the 
application may require the updating of forwarding state 
on multiple switches to happen in a particular order (to 
minimize, for example, packet drops). For this purpose, 
the API provides a synchronization primitive: if called 
for an entity, the control logic will receive a callback once 
the state has been pushed. After receiving the callback, 
the control logic may then inspect the contents of the NIB 
and verify that the state is as expected before proceeding. 
We note that if the control logic implements distributed 
coordination, race-conditions in state updates will either 
not exist or will be transient in nature. 

An application may also only rely on NIB notifications 
to react to failures in modifications as they would any 
other network state changes. Table | lists available NIB- 
manipulation methods. 


3 Scaling and Reliability 


To be a viable alternative to the traditional network 
architecture, Onix must meet the scalability and reliability 
requirements of today’s (and tomorrow’s) production net- 
works. Because the NIB is the focal point for the system 
state and events, its use largely dictates the scalability and 
reliability properties of the system. For example, as the 
number of elements in the network increases, a NIB that 
is not distributed could exhaust system memory. Or, the 
number of network events (generated by the NIB) or work 
required to manage them could grow to saturate the CPU 
of a single Onix instance.’ 

This and the following section describe the NIB 
distribution framework that enables Onix to scale to very 


7In one of our upcoming deployments, if a single-instance 
application took one second to analyze the statistics of a single Port 
and compute a result (e.g., for billing purposes), that application would 
take two months to process all Ports in the NIB. 
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large networks, and to handle network and controller 
failure. 


3.1 Scalability 


Onix supports three strategies that can used to improve 
scaling. First, it allows control applications to partition 
the workload so that adding instances reduces work 
without merely replicating it. Second, Onix allows for 
aggregation in which the network managed by a cluster 
of Onix nodes appears as a single node in a separate 
cluster’s NIB. This allows for federated and hierarchical 
structuring of Onix clusters, thus reducing the total 
amount of information required within a single Onix 
cluster. Finally, Onix provides applications with control 
over the consistency and durability of the network state. 
In more detail: 


e Partitioning. The network control logic may config- 
ure Onix so that a particular controller instance keeps 
only a subset of the NIB in memory and up-to-date. 
Further, one Onix instance may have connections to 
a subset of the network elements, and subsequently, 
can have fewer events originating from the elements 
to process. 


e Aggregation. In a multi-Onix setup, one instance of 
Onix can expose a subset of the elements in its NIB 
as an aggregate element to another Onix instance. 
This is typically used to expose less fidelity to upper 
tiers in a hierarchy of Onix controllers. For example, 
in a large campus network, each building might 
be managed by an Onix controller (or controller 
cluster) which exposes all of the network elements 
in that building as a single aggregate node to a global 
Onix instance which performs campus-wide traffic 
engineering. This is similar in spirit to global control 
management paradigms in ATM networks [27]. 


e Consistency and durability. The control logic 
dictates the consistency requirements for the network 
state it manages. This is done by implementing any 
of the required distributed locking and consistency 
algorithms for state requiring strong consistency, 
and providing conflict detection and resolution for 
state not guaranteed to be consistent by use of these 
algorithms. By default, Onix provides two data 
stores that an application can use for state with differ- 
ing preferences for durability and consistency. For 
state applications that favor durability and stronger 
consistency, Onix offers a replicated transactional 
database and, for volatile state that is more tolerant 
of inconsistencies, a memory-based one-hop DHT. 
We return to these data stores in Section 4. 


The above scalability mechanisms can be used to 
manage networks too large to be controlled by a single 


Onix instance. To demonstrate this, we will use a 
running example: an application that can establish paths 
between switches in a managed topology, with the goal 
of establishing complete routes through the network. 


Partition. We assume a network with a modest number 
of switches that can be easily handled by a single Onix 
instance. However, the number and size of all forwarding 
state entries on the network exceeds the memory resources 
of a single physical server. 

To handle such a scenario, the control logic can repli- 
cate all switch state, but it must partition the forwarding 
state and assign each partition to a unique Onix instance 
responsible for managing that state. The method of 
partitioning is unimportant as long as it creates relatively 
consistent chunks. 

The control logic can record the switch and link 
inventory in the fully-replicated, durable state shared 
by all Onix instances, and it can coordinate updates 
using mechanisms provided by the platform. However, 
information that is more volatile, such as link utilization 
levels, can be stored in the DHT. Each controller can 
use the NIB’s representation of the complete physical 
topology (from the replicated database), coupled with 
link utilization data (from the DHT), to configure the 
forwarding state necessary to ensure paths meeting the 
deployment’s requirements throughout the network. 

The resulting distribution strategy closely resembles the 
use of head-end routers in MPLS [24] to manage tunnels. 
However, instead of a DHT, MPLS uses intra-domain 
routing protocols to disseminate the link utilization 
information. 


Aggregate. As our example network grows, partition- 
ing the path management no longer suffices. We assume 
that the Onix instances are still capable of holding the 
full NIB, but the control logic cannot keep up with the 
number of network events and thus saturates the CPU. 
This scenario follows from our experience in which CPU 
is commonly the limiting factor for control applications. 

To shield remote instances from high-rates of updates, 
the application can aggregate a topology as a single 
logical node, and use that as the unit of event dissem- 
ination between instances. For example, the topology 
can be divided into logical areas, each managed by a 
distinct Onix instance. Onix instances external to an 
area would know the exact physical topology within the 
area, but would retrieve only topologically-aggregated 
link-utilization information from the DHT (originally 
generated by instances within that area). 

This use of topological aggregation is similar to ATM 
PNNI [27], in which the internals of network areas are 
aggregated into single logical nodes when exposed to 
neighboring routers. The difference is that the Onix 
instances and switches still have full connectivity between 
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them and it is assumed that the latency between any 
element (between the switches and Onix instances or 
between Onix instances) is not a problem. 


Partition further. At some point, the number of el- 
ements within a control domain will overwhelm the 
capacity of a single Onix instance. However, due to 
relatively slow change rates of the physical network, it is 
still possible to maintain a distributed view of the network 
graph (the NIB). 

Applications can still rely on aggregating link utiliza- 
tion information, but in a partitioned NIB scheme, they 
would use the inter-Onix state distribution mechanisms to 
mediate requests to switches in remote areas; this can be 
done by using NIB attributes as a remote communication 
channel. The “request” and “response” are relayed 
between the areas using the DHT. Because this transfer 
might happen via a third Onix instance, any application 
that needs faster response times may configure DHT key 
ranges for areas and use DHT keys such that for the 
modified entity its attributes are stored within the proper 
area. 

It is possible for this approach to scale to wide-area 
deployment scenarious. For example, each partition 
could represent a large network area, and each network 
is exposed as an aggregate node to a cluster of Onix 
instances that make global routing decisions over the 
aggregate nodes. Thus, each partition makes local 
routing decisions, and the cluster makes routing decisions 
between these partitions (abstracting each as a single 
logical node). The state distribution requirements for 
this approach would be almost identical to hierarchical 
MPLS. 


Inter-domain aggregation. Once the controlled net- 
work spans two separate ASes, sharing full topology 
information among the Onix instances becomes infeasible 
due to privacy reasons and the control logic designer 
needs to adapt the design again to changed requirements. 

The platform does not dictate how the ASes would peer, 
but at a high-level they would have two requirements 
to fulfill: a) sharing their topologies at some level of 
detail (while preserving privacy) with their peers, and b) 
establishing paths for each other proactively (according 
to a peering contract) or on-demand, and exchanging their 
ingress information. For both requirements, there are 
proposals in academia [13] and industry deployments [12] 
that applications could implement to arrange peering 
between Onix instances in adjacent ASes. 


3.2. Reliability 


Control applications on Onix must handle four types 
of network failures: forwarding element failures, link 
failures, Onix instance failures, and failures in connectiv- 
ity between network elements and Onix instances (and 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


between the Onix instances themselves). This section 
discusses each in turn. 


Network element and link failures. Modern control 
planes already handle network element and link failures, 
and control logic built on Onix can use the same 
mechanisms. If a network element or link fails, the 
control logic has to steer traffic around the failures. The 
dissemination times of the failures through the network 
together with the re-computation of the forwarding tables 
define the minimum time for reacting to the failures. 
Given increasingly stringent requirements convergence 
times, it may be preferrable that convergence be handled 
partially by backup paths with fast failover mechanisms 
in the network element. 


Onix failures. To handle an Onix instance failure, the 
control logic has two options: running instances can 
detect a failed node and take over the responsibilities 
of the failed instance quickly, or more than one instance 
can simultaneously manage each network element. 

Onix provides coordination facilities (discussed in 
Section 4) for detecting and reacting to instance failures. 
For the simultaneous management of a network element 
by more than one Onix instance, the control logic has 
to handle lost update race conditions when writing to 
network state. To help, Onix provides hooks that appli- 
cations can use to determine whether conflicting changes 
made by other instances to the network element can be 
overridden. Provided the control logic computes the same 
network element state in a deterministic fashion at each 
Onix instance, i.e., every Onix instance implements the 
same algorithm, the state can remain inconsistent only 
transiently. At the high-level, this approach is similar to 
the reliability mechanisms of RCP [3], in which multiple 
centralized controllers push updates over iBGP to edge 
routers. 


Connectivity infrastructure failures. Onix state dis- 
tribution mechanisms decouple themselves from the 
underlying topology. Therefore, they require connectivity 
to recover from failures, both between network elements 
and Onix instances as well as between Onix instances. 
There are a number of methods for establishing this 
connectivity. We describe some of the more common 
deployment scenarios below. 

It is not unusual for an operational network to have a 
dedicated physical network or VLAN for management. 
This is common, for example, in large datacenter build- 
outs or hosting environments. In such environments, 
Onix can use the management network for control traffic, 
isolating it from forwarding plane disruptions. Under 
this deployment model, the control network uses standard 
networking gear and thus any disruption to the control 
network is handled with traditional protocols (e.g., OSPF 
or spanning tree). 
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Even if the environment does not provide a separate 
control network, the physical network topology is typ- 
ically known to Onix. Therefore, it is possible for the 
control logic to populate network elements with static 
forwarding state to establish connecitivty between Onix 
and the switches. To guarantee connectivity in presence 
of failures, source routing can be combined with multi- 
pathing (also implemented below Onix): source routing 
packets over multiple paths can guarantee extremely 
reliable connectivity to the managed network elements, 
as well as between Onix instances. 


4 Distributing the NIB 


This section describes how Onix distributes its Network 
Information Base and the consistency semantics an 
application can expect from it. 


4.1 Overview 


Onix’s support for state distribution mechanisms was 
guided by two observations on network management ap- 
plications. First, applications have differing requirements 
on scalability, frequency of updates on shared space, 
and durability. For example network policy declarations 
change slowly and have stringent durability requirements. 
Conversely, logic using link load information relies on 
rapidly-changing network state that is more transient 
in nature (and thus does not have the same durability 
requirements). 

Second, distinct applications often have different 
requirements for the consistency of the network state 
they manage. Link state information and network 
policy configurations are extreme examples: transiently- 
inconsistent status flags of adjacent links are easier for an 
application to resolve than an inconsistency in network- 
wide policy declaration. In the latter case, a human may 
be needed to perform the resolution correctly. 

Onix supports an application’s ability to choose be- 
tween update speeds and durability by providing two sep- 
arate mechanisms for distributing network state updates 
between Onix instances: one designed for high update 
rates with guaranteed availability, and one designed 
with durability and consistency in mind. Following 
the example of many distributed storage systems that 
allow applications to make performance/scalability trade- 
offs [2, 8, 29, 31], Onix makes application designers 
responsible for explicitly determining their preferred 
mechanism for any given state in the NIB — they can 
also opt to use the NIB solely as storage for local state. 
Furthermore, Onix can support arbitrary storage systems 
if applications write their own import and export modules, 
which transfer data into the NIB from storage systems 
and out of the NIB to storage systems respectively. 

In solving the applications’ preference for differing 
consistency requirements, Onix relies on their help: it 


expects the applications to use the provided coordination 
facilities [19] to implement distributed locking or consen- 
sus protocols as needed. The platform also expects the 
applications to provide the implementation for handling 
any inconsistencies arising between updates, if they are 
not using strict data consistency. While applications are 
given the responsibility to implement the inconsistency 
handling, Onix provides a programmatic framework to 
assist the applications in doing so. 

Thus, application designers are free to determine 
the trade-off between potentially simplified application 
architectures (promoting consistency and durability) and 
more efficient operations (with the cost of increased 
complexity). We now discuss the state distribution 
between Onix instances in more detail, as well as how 
Onix integrates network elements and their state into these 
distribution mechanisms. 


4.2 State Distribution Between Onix Instances 


Onix uses different mechanisms to keep state consistent 
between Onix instances and between Onix and the 
network forwarding elements. The reasons for this are 
twofold. First, switches generally have low-powered 
management CPUs and limited RAM. Therefore, the 
protocol should be lightweight and primarily for con- 
sistency of forwarding state. Conversely, Onix instances 
can run on high powered general compute platforms and 
don’t have such limitations. Secondly, the requirements 
for managing switch state are much narrower and better 
defined than that needed by any given application. 

Onix implements a transactional persistent database 
backed by a replicated state machine for disseminating 
all state updates requiring durability and simplified 
consistency management. The replicated database comes 
with severe performance limitations, and therefore it 
is intended to serve only as a reliable dissemination 
mechanism for slowly changing network state. The 
transactional database provides a flexible SQL-based 
querying API together with triggers and rich data models 
for applications to use directly, as necessary. 

To integrate the replicated database with the NIB, 
Onix includes import/export modules that interact with 
the database. These components load and store entity 
declarations and their attributes from/to the transactional 
database. Applications can easily group NIB modifica- 
tions together into a single transaction to be exported to 
the database. When the import module receives a trigger 
invocation from the database about changed database 
contents, it applies the changes to the NIB. 

For network state needing high update rates and avail- 
ability, Onix provides a one-hop, eventually-consistent, 
memory-only DHT (similar to Dynamo [9]), relaxing 
the consistency and durability guarantees provided by 
the replicated database. In addition to the common 
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get/put API, the DHT provides soft-state triggers: the 
application can register to receive a callback when a 
particular value gets updated, after which the trigger must 
be reinstalled. False positives are allowed to simplify 
the implementation of the DHT replication mechanism. 
The DHT implementation manages its membership state 
and assigns key-range responsibilities using the same 
coordination mechanisms provided to applications. 

Updates to the DHT by multiple Onix instances can 
lead to state inconsistencies. For instance, while using 
triggers, the application must be carefully prepared for any 
race conditions that could occur due to multiple writers 
and callback delays. Also, the introduction of a second 
storage system may result in inconsistencies in the NIB. 
In such cases, the Onix DHT returns multiple values for 
a given key, and it is up to the applications to provide 
conflict resolution, or avoid these conditions by using 
distributed coordination mechanisms. 


4.3 Network Element State Management 


The Onix design does not dictate a particular protocol for 
managing network element forwarding state. Rather, the 
primary interface to the application is the NIB, and any 
suitable protocol supported by the elements in the network 
can be used under the covers to keep the NIB entities in 
sync with the actual network state. In this section we 
describe the network element state management protocols 
currently supported by Onix. 

OpenFlow [23] provides a performance-optimized 
channel to the switches for managing forwarding tables 
and quickly learning port status changes (which may have 
an impact on reachability within the network). Onix turns 
OpenFlow events and operations into state that it stores in 
the NIB entities. For instance, when an application adds 
a flow entry toa ForwardingTabl1e entity in the NIB, 
the OpenFlow export component will translate that into 
an OpenFlow operation that adds the entry to the switch 
TCAM. Similarly, the TCAM entries are accessible to the 
application in the contents of the ForwardingTable 
entity. 

For managing and accessing general switch configu- 
ration and status information, an Onix instance can opt 
to connect to a switch over a configuration database pro- 
tocol (such as the one supported by Open vSwitch [26]). 
Typically this database interface exposes the switch 
internals that OpenFlow does not. For Onix, the protocol 
provides a mechanism to receive a stream of switch state 
updates, as well as to push changes to the switch state. 
The low-level semantics of the protocol closely resembles 
the transactional database (used between controllers) 
discussed above, but instead of requiring full SQL support 
from the switches, the database interface has a more 
restricted query language that does not provide joins. 

Similarly to the integration with OpenFlow, Onix 
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provides convenient, data-oriented access to the switch 
configuration state by mapping the switch database 
contents to NIB entities that can be read and modified 
by the applications. For example, by creating and 
attaching Port entities with proper attributes to a 
ForwardingEngine entity (which corresponds to a 
single switch datapath), applications can configure new 
tunnel endpoints without knowing that this translates to 
an update transaction sent to the corresponding switch. 





4.4 Consistency and Coordination 


The NIB is the central integration point for multiple data 
sources (other Onix instances as well as connected net- 
work elements); that is, the state distribution mechanisms 
do not interact directly with each other, but rather they 
import and export state into the NIB. To support multiple 
applications with possibly very different scalability and 
reliability requirements, Onix requires the applications 
to declare what data should be imported to and exported 
from a particular source. Applications do this through the 
configuration of import and export modules. 

The NIB integrates the data sources without requiring 
strong consistency, and as a result, the state updates to 
be imported into NIB may be inconsistent either due 
to the inconsistency of state within an individual data 
source (DHT) or due to inconsistencies between data 
sources. To this end, Onix expects the applications to 
register inconsistency resolution logic with the platform. 
Applications have two means to do so. First, in Onix, 
entities are C++ classes that the application may extend, 
and thus, applications are expected simply to use in- 
heritance to embed referential inconsistency detection 
logic into entities so that applications are not exposed to 
inconsistent state.* Second, the plugins the applications 
pass to the import/export components implement conflict 
resolution logic, allowing the import modules to know 
how to resolve situations where both the local NIB and 
the data source have changes for the same state. 

For example, consider a new Node N, imported into 
the NIB from the replicated database. If N contains 
a reference in its list of ports to Port P that has not 
yet been imported (because they are retrieved from the 
network elements, not from the replicated database), the 
application might prefer that NV not expose a reference 
to P to the control logic until P has been imported. 
Furthermore, if the application is using the DHT to 
store statistics about the number of packets forwarded 
by N, it is possible for the import module of an 
Onix instance to retrieve two different values for this 
number from the DHT (e.g., due to rebalancing of 
controllers’ responsibilities within a cluster, resulting in 
two controllers transiently updating the same value). The 


8 Any inconsistent changes remain pending within the NIB until they 
can be applied or applications deem it invalid for good. 
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application’s conflict resolution logic must reconcile these 
values, storing only one into the NIB and back out to the 
DHT. 

This leaves the application with a consistent topology 
data model. However, the application still needs to react to 
Onix instance failures and use the provided coordination 
mechanisms to determine which instances are responsible 
for different portions of the NIB. As these responsibilities 
shift within the cluster, the application must instruct the 
corresponding import and export modules to adjust their 
behaviors. 

For coordination, Onix embeds Zookeeper [19] and 
provides applications with an object-oriented API to its 
filesystem-like hierarchical namespace, convenient for 
realizing distributed algorithms for consensus, group 
membership, and failure detection. While some appli- 
cations may prefer to use Zookeeper’s services directly 
to store persistent configuration state instead of the trans- 
actional database, for most the object size limitations of 
Zookeeper and convenience of accessing the configuration 
state directly through the NIB are a reason to favor the 
transactional database. 


5 Implementation 


Onix consists of roughly 150,000 lines of C++ and 
integrates a number of third party libraries. At its simplest, 
Onix is a harness which contains logic for communicating 
with the network elements, aggregating that information 
into the NIB, and providing a framework in which appli- 
cation programmers can write a management application. 

A single Onix instance can run across multiple pro- 
cesses, each implemented using a different programming 
language, if necessary. Processes are interconnected using 
the same RPC system that Onix instances can use among 
themselves, but instead of running over TCP/IP it runs 
over local IPC connections. In this model, supporting a 
new programming language becomes a matter of writing 
a few thousand lines of integration code, typically in the 
new language itself. Onix currently supports C++, Python, 
and Java. 

Independent of the programming language, all soft- 
ware modules in Onix are written as loosely-coupled 
components, which can be replaced with others without 
recompiling Onix as long as the component’s binary 
interface remains the same. Components can be loaded 
and unloaded dynamically and designers can express 
dependencies between components to ensure they are 
loaded and unloaded in the proper order. 


6 Applications 


In this section, we discuss some applications currently 
being built on top of Onix. In keeping with the focus of 
the paper, we limit the applications discussed to those that 
are being developed for production environments. We 


believe the range of functionality they cover demonstrates 
the generality of the platform. Table 2 lists the ways in 
which these applications stress the various Onix features. 


Ethane. For enterprise networks, we have built a 
network management application similar to Ethane [5] to 
enforce network security policies. Using the Flow-based 
Management Language (FML) [18] network administra- 
tors can declare security policies in a centralized fashion 
using high-level names instead of network-level addresses 
and identifiers. The application processes the first packet 
of every flow obtained from the first hop switch: it tracks 
hosts’ current locations, applies the security policies, and 
if the flow is approved, sets up the forwarding state for 
the flow through the network to the destination host. The 
link state of the network is discovered through LLDP 
messages sent by Onix instances as each switch connects. 

Since the aggregate flow traffic of a large network can 
easily exceed the capacity of a single server, large-scale 
deployment of our implementation, it requires multiple 
Onix instances to partition the flow processing. Further, 
having Onix on the flow-setup path makes failover 
between multiple instances particularly important. 

Partitioning the flow-processing state requires that all 
controllers be able to set up paths in the network, end to 
end. Therefore, each Onix instance needs to know the 
location of all end-points as well as the link state of the 
network. However, it is not particularly important that this 
information be strongly consistent between controllers. 
At worst, a flow is routed to an old location of the host 
over a failed link, which is impossible to avoid during 
network element failures. It is also unnecessary for 
the link state to be persistent, since this information is 
obtained dynamically. Therefore, the controllers can 
use the DHT for storing link-state, which allows tens 
of thousands of updates per second (see Section 7). 


Distributed Virtual Switch (DVS). In virtualized en- 
terprise network environments, the network edge consists 
of virtual, software-based L2 switch appliances within 
hypervisors instead of physical network switches [26]. 
It is not uncommon for virtual deployments (especially 
in cloud-hosting providers) to consist of tens of VMs 
per server, and to have hundreds, thousands or tens of 
thousands of VMs in total. These environments can also 
be highly dynamic, such that VMs are added, deleted and 
migrated on the fly. 

To cope with such environments, the concept of a 
distributed virtual switch (DVS) has arisen [33]. A DVS 
roughly operates as follows. It provides a logical switch 
abstraction over which policies (e.g., policing, QoS, 
ACLs) are declared over the logical switch ports. These 
ports are bound to virtual machines through integration 
with the hypervisor. As the machines come and go and 
move around the network, the DVS ensures that the 
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Table 2: Aspects of Onix especially stressed by deployed control logic applications. 


policies follow the VMs and therefore do not have to 
be reconfigured manually; to this end, the DVS integrates 
to the host virtualization platform. 

Thus, when operating as part of a DVS application, 
Onix is not involved in forwarding plane flow setup, 
but only invoked when VMs are created, destroyed, or 
migrated. Hypervisors are organized as pools consisting 
of a reasonably small number of hypervisors and VMs 
typically do not migrate across pools; and therefore, 
the control logic can easily partition itself according to 
these pools. A single Onix instance then handles all the 
hypervisors of a single pool. All the switch configuration 
state is persisted to the transactional database, whereas all 
VM locations are not shared between Onix instances. 

If an Onix instance goes down, the network can 
still operate. However, VM dynamics will no longer 
be allowed. Therefore, high availability in such an 
environment is less critical than in the Ethane environment 
described previously, in which an Onix crash would 
render the network inoperable to new flows. In our DVS 
application, for simplicity reasons reliability is achieved 
through a cold standby prepared to boot in a failure 
condition. 


Multi-tenant virtualized data centers. Multi-tenant 
environments exacerbate the problems described in the 
context of the previous application. The problem state- 
ment is similar, however: in addition to handling end-host 
dynamics, the network must also enforce both addressing 
and resource isolation between tenant networks. Tenant 
networks may have, for example, overlapping MAC 
or IP addresses, and may run over the same physical 
infrastructure. 

We have developed an application on top of Onix which 
allows the creation of tenant-specific L2 networks. These 
networks provide a standard Ethernet service model and 
can be configured independently of each other and can 
span physical network subnets. 

The control logic isolates tenant networks by encap- 
sulating tenants’ packets at the edge, before they enter 
the physical network, and decapsulating them when they 
either enter another hypervisor or are released to the 
Internet. For each tenant virtual network, the control logic 
establishes tunnels pair-wise between all the hypervisors 
running VMs attached to the tenant virtual network. As 
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a result, the number of required tunnels is O(N =), and 
thus, with potentially tens of thousands of VMs per tenant 
network, the state for just tunnels may grow beyond the 
capacity of a single Onix instance, not to mention that the 
switch connections can be equally numerous.” 

Therefore, the control logic partitions the tenant net- 
work so that multiple Onix instances share responsibility 
for the network. A single Onix instance manages only a 
subset of hypervisors, but publishes the tunnel end-point 
information over the DHT so any other instances needing 
to set up a tunnel involving one of those hypervisors can 
configure the DHT import module to load the relevant 
information into the NIB. The tunnels themselves are 
stateless, and thus, multiple hypervisors can send traffic 
to a single receiving tunnel end-point. 


Scale-out carrier-grade IP router. We are currently 
considering a design to create a scale-out BGP router us- 
ing commodity switching components as the forwarding 
plane. This project is still in the design phase, but we 
include it here to demonstrate how Onix can be used with 
traditional network control logic. 

In our design, Onix provides the “glue” between the 
physical hardware (a collection of commodity switches) 
and the control plane (an open source BGP stack). Onix 
is therefore responsible for aggregating the disparate 
hardware devices and presenting them to the control logic 
as a single forwarding plane, consisting of an L2/L3 table, 
and a set of ports. Onix is also responsible for translating 
the RIB, as calculated by the BGP stack, into flow entries 
across the cluster of commodity switches. 

In essence, Onix will provide the logic to build a scale- 
out chassis from the switches. The backplane of the 
chassis is realized through the use of multiple connections 
and multi-pathing between the switches, and individual 
switches act as line-cards. If a single switch fails, Onix 
will alert the routing stack that the associated ports on the 
full chassis have gone offline. However, this should not 
affect the other switches within the cluster. 

The control traffic from the network (e.g., BGP or 
IGP traffic) is forwarded from the switches to Onix, 
which annotates it with the correct logical switch port and 
forwards it to the routing stack. Because only a handful of 


°The VMs of a single tenant are not likely to share physical servers 
to avoid fate-sharing in hardware failure conditions. 
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Figure 3: Attribute modification throughput as the number of 
listeners attached to the NIB increases. 
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Figure 4: Memory usage as the number of NIB entities 
increases. 


switches are used, the memory and processing demands 
of this applications are relatively modest. A single Onix 
instance with an active failover (on which the hardware 
configuration state is persistent) is sufficient for even very 
large deployments. This application is discussed in more 
detail in [7]. 


7 Evaluation 


In this section, we evaluate Onix in two ways: with 
micro-benchmarks, designed to test Onix’s performance 
as a general platform, and with end-to-end performance 
measurements of an in-development Onix application in 
a test environment. 


7.1 Scalability Micro-Benchmarks 


Single-node performance. We first benchmark three 
key scalability-related aspects of a single Onix instance: 
throughput of the NIB, memory usage of the NIB, and 
bandwidth in the presence of many connections. 

The NIB is the focal point of the API, and the 
performance of an application will depend on the capacity 
the NIB has for processing updates and notifying listeners. 
To measure this throughput, we ran a micro-benchmark 
where an application repeatedly acquired exclusive access 
to the NIB (by its cooperative thread acquiring the CPU), 
modified integer attributes of an entity (which triggers 
immediate notification of any registered listeners), and 
then released NIB access. In this test, none of the listeners 


acted on the notifications of NIB changes they received. 


Figure 3 contains the results. With only a single attribute 
modification, this micro-benchmark essentially becomes 
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Figure 5: Number of 64-byte packets forwarded per second by 
a single Onix node, as the # of switch connections increases. 


a benchmark for our threading library, as acquiring 
exclusive access to the NIB translates to a context switch. 
As the number of modified attributes between context 
switches increases, the effective throughput increases 
because the modifications involve only a short, fine-tuned 
code path through the NIB to the listeners. 

Onix NIB entities provide convenient state access 
for the application as well as for import and export 
modules. The NIB must thus be able to handle a 
large number of entries without excessive memory usage. 
Figure 4 displays the results of measuring the total 
memory consumption of the C++ process holding the 
NIB while varying both network topology size and the 
number of attributes per entity. Each attribute in this 
test is 16 bytes (on average), with an 8-byte attribute 
identifier (plus C++ string overhead); in addition, Onix 
uses a map to store attributes (for indexing purposes) that 
reserves memory in discrete chunks. A zero-attribute 
entity, including the overhead of storing and indexing 
it in the NIB, consumes 191 bytes. The results in 
Figure 4 suggest a single Onix instance (on a server- 
grade machine) can easily handle networks of millions 
of entities. As entities include more attributes, their sizes 
increase proportionally. 

Each Onix instance has to connect to the switches 
it manages. To stress this interface, we connected 
a (software) switch cloud to a single Onix instance and 
ran an application that, after receiving a 64-byte packet 
from a random switch, made a forwarding decision 
without updating the switch’s forwarding tables. That 
is, the application sent the packet back to the switch with 
forwarding directions for that packet alone. Because of 
the application’s simplicity, the test effectively bench- 
marks the performance of our OpenFlow stack, which 
has the same code path for both packets and network 
events (such as port events). Figure 5 shows the stack 
can perform well (forwarding over one hundred thousand 
packets per second), with up to roughly one thousand 
concurrent connections. We have not yet optimized our 
implementation in this regard, and the results highlight a 
known limitation of our threading library, which forces 
the OpenFlow protocol stack to do more threading context 
switches as the number of connections increases. Bumps 
in the graph are due to the operating system scheduling 
the controller process over multiple CPU cores. 
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Figure 6: RPC calls per second processed by a single Onix 
node, as the size of the RPC request-response pair increases. 
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Figure 7: A CDF showing the latency of updating a DHT value 
at one node, and for that update to be fetched by another node 
in a 5-node network. 


Multi-node performance. Onix instances use three 
mechanisms to cooperate: two state update dissemination 
mechanisms (the DHT and the replicated, transactional 
database) and the Zookeeper coordination mechanism. 
Zookeeper’s performance has been studied elsewhere [19], 
so we focus on the DHT and replicated database. 


The throughput of our memory-based DHT is effec- 
tively limited by the Onix RPC stack. Figure 6 shows 
the call throughput between an Onix instance acting as 
an RPC client, and another acting as an RPC server, with 
the client pipelining requests to compensate for network 
latency. The DHT performance can then be seen as the 
RPC performance divided by the replication factor. While 
a single value update may result in both a notification 
call and subsequent get calls from each Onix instance 
having an interest in the value, the high RPC throughput 
still shows our DHT to be capable of handling very 
dynamic network state. For example, if you assume 
that an application fully replicates the NIB to 5 Onix 
instances, then each NIB update will result in 22 RPC 
request-response pairs (2 to store two copies of the data 
in the DHT, 2+5 to notify all instances of the update, and 
25 for all instances to fetch the new value from both 
replicas and reinstall their triggers). Given the results in 
Figure 6, this implies that the application, in aggregate, 
can handle 24,000 small DHT value updates per second. 
In a real deployment this might translate, for example, 
to updating a load attribute on 24,000 link entities every 
second — a fairly ambitious scale for any physical network 
that is controlled by just five Onix instances. Applications 
can use aggregation and NIB partitioning to scale further. 

Our replicated transactional database is not optimized 


for throughput. However, its performance has not yet 
become a bottleneck due to the relatively static nature 
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Table 3: The throughput of Onix’s replicated database. 


of the data it stores. Table 3 shows the throughput 
for different query batching sizes (1/3 of queries are 
INSERTs, and 2/3 are SELECTs) in a 5-node replicated 
database. If the application stores its port inventory in the 
replicated database, for example, without any batching it 
can process 17 port additions and removals per second, 
along with about 6.5 queries per second from each node 
about the existence of ports (17 + 6.5 * 5 ~ 49.7). 


7.2 Reliability Micro-Benchmarks 


A primary consideration for production deployments is 
reliability in the face of failures. We now consider 
the three failure modes a control application needs to 
handle: link failures, switch failures, and Onix instance 
failures. Finally, we consider the perceived network 
communication failure time with an Onix application. 


Link and switch failures. Onix instances monitor their 
connections to switches using aggressive keepalives. 
Similarly, switches monitor their links (and tunnels) using 
hardware-based probing (such as 802.lag CFM [1]). 
Both of these can be fine-tuned to meet application 
requirements. 

Once a link or switch failure is reported to the control 
application, the latencies involved in disseminating the 
failure-related state updates throughout the Onix cluster 
become essential; they define the absolute minimum time 
the control application will take to react to the failure 
throughout the network. 

Figure 7 shows the latencies of DHT value propagation 
in a 5-node, LAN-connected network. However, once 
the controllers are more distant from each other in 
the network, the DHT’s pull-based approach begins to 
introduce additional latencies compared to the ideal push- 
based methods common in distributed network protocols 
today. Also, the new value being put to the DHT may 
be placed on an Onix instance not on the physical path 
between the instance updating the value and the one 
interested in the new value. Thus, in the worst case, a 
state update may take four times as long as it takes to push 
the value (one hop to put the new value, one to notify an 
interested Onix instance, and two to get the new value). 

In practice, however, this overhead tends not to 
impact network performance, because practical avail- 
ability requirements for production traffic require the 
control application to prepare for switch and link failures 
proactively by using backup paths. 


Onix instance failures. The application has to detect 
failed Onix instances and then reconfigure responsibilities 
within the Onix cluster. For this, applications rely on the 


USENIX Association 


USENIX Association 











Probability 








: 0 200 400 600 800 
Perceived failure latency (ms) 

Figure 8: A CDF of the perceived communication disruption 
time between two hosts when an intermediate switch fails. These 
measurements include the one-second (application-configurable) 
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Zookeeper coordination facilities provided by Onix. As 
with its throughput, we refer the reader to a previous 
study [19] for details. 


Application test. Onix is currently being used by a 
number of organizations as the platform for building 
commercial applications. While scaling work and testing 
is ongoing, applications have managed networks of up to 
64 switches with a single Onix instance, and Onix has 
been tested in clusters of up to 5 instances. 

We now measure the end-to-end failure reaction time of 
the multi-tenant virtualized data center application (Sec- 
tion 6). The core of the application is a set of tunnels 
creating an L2 overlay. If a switch hosting a tunnel fails, 
the application must patch up the network quickly to 
ensure continued connectivity withing the overlay. 

Figure 8 shows how quickly the application can create 
new tunnels to reestablish the connectivity between hosts 
when a switch hosting a tunnel fails. The measured time 
includes the time for Onix to detect the switch failure, 
and for the application to decide on a new switch to 
host the tunnel, create the new tunnel endpoints, and 
update the switch forwarding tables. The figure shows the 
median disruption for the host-to-host communication is 
1120 ms. Given the configured one-second switch failure 
detection time, this suggests it takes Onix 120 ms to repair 
the tunnel once the failure has been detected. Although 
this application is unoptimized, we believe these results 
hold promise that a complete application on Onix can 
achieve reactive properties on par with traditional routing 
implementations. 


8 Related Work 


As mentioned in Section 1, Onix descends from a long 
line of work in which the control plane is separated 
from the dataplane [3-6, 15, 16,23], but Onix’s focus on 
being a production-quality control platform for large-scale 
networks led us to focus more on reliability, scalability, 
and generality than previous systems. Ours is not the 
first system to consider network control as a distributed 
systems problem [10, 20], although we do not anticipate 
the need to run our platform on end-hosts, due to 


the flexibility of merchant silicon and other efforts to 
generalize the forwarding plane [23], and the rapid 
increase in power of commodity servers. 


An orthogonal line of research focuses on offering 
network developers an extensible forwarding plane (e.g., 
RouteBricks [11], Click [22] and XORP [17]); Onix is 
complementary to these systems in offering an extensible 
control plane. Similarly, Onix can be the platform 
for flexible data center network architectures such as 
SEATTLE [21], VL2 [14] and Portland [25] to manage 
large data centers. This was explored somewhat in [30]. 


Other recent work [34] reduces the load of a centralized 
controller by distributing network state amongst switches. 
Onix focuses on the problem of providing generic dis- 
tributed state management APIs at the controller, instead 
of focusing on a particular approach to scale. We view 
this work as distinct but compatible, as this technique 
could be implemented within Onix. 


Onix also follows the path of many earlier distributed 
systems that rely on applications’ help to relax consis- 
tency requirements in order to improve the efficiency of 
state replication. Bayou [31], PRACTI [2], WheelFS [29] 
and PNUTS [8] are examples of such systems. 


9 Conclusion 


The SDN paradigm uses the control platform to simplify 
network control implementations. Rather than forcing 
developers to deal directly with the details of the physical 
infrastructure, the control platform handles the lower- 
level issues and allows developers to program their 
control logic on a high-level API. In so doing, Onix 
essentially turns networking problems into distributed 
systems problem, resolvable by concepts and paradigms 
familiar for distributed systems developers. 


However, this paper is not about the ideology of SDN, 
but about its implementation. The crucial enabler of 
this approach is the control platform, and in this paper 
we present Onix as an existence proof that such control 
platforms are feasible. In fact, Onix required no novel 
mechanisms, but instead involves only the judicious use 
of standard distributed system design practices. 


What we should make clear, however, is that Onix 
does not, by itself, solve all the problems of network 
management. The designers of management applications 
still have to understand the scalability implications of 
their design. Onix provides general tools for managing 
state, but it does not magically make problems of scale 
and consistency disappear. We are still learning how to 
build control logic on the Onix API, but in the examples 
we have encountered so far management applications are 
far easier to build with Onix than without it. 
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Abstract 


A persistent problem in computer network research is 
validation. When deciding how to evaluate a new feature 
or bug fix, a researcher or operator must trade-off real- 
ism (in terms of scale, actual user traffic, real equipment) 
and cost (larger scale costs more money, real user traf- 
fic likely requires downtime, and real equipment requires 
vendor adoption which can take years). Building a realis- 
tic testbed is hard because “real” networking takes place 
on closed, commercial switches and routers with spe- 
cial purpose hardware. But if we build our testbed from 
software switches, they run several orders of magnitude 
slower. Even if we build a realistic network testbed, it 
is hard to scale, because it is special purpose and is in 
addition to the regular network. It needs its own loca- 
tion, support and dedicated links. For a testbed to have 
global reach takes investment beyond the reach of most 
researchers. 

In this paper, we describe a way to build a testbed 
that is embedded in—and thus grows with—the net- 
work. The technique—embodied in our first prototype, 
FlowVisor—slices the network hardware by placing a 
layer between the control plane and the data plane. We 
demonstrate that FlowVisor slices our own production 
network, with legacy protocols running in their own 
protected slice, alongside experiments created by re- 
searchers. The basic idea is that if unmodified hardware 
supports some basic primitives (in our prototype, Open- 
Flow, but others are possible), then a worldwide testbed 
can ride on the coat-tails of deployments, at no extra ex- 
pense. Further, we evaluate the performance impact and 
describe how FlowVisor is deployed at seven other cam- 
puses as part of a wider evaluation platform. 


1 Introduction 


For many years the networking research community has 
grappled with how best to evaluate new research ideas. 
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Figure 1: Today’s evaluation process is a continuum 
from controlled but synthetic to uncontrolled but realistic 
testing, with no clear path to vendor adoption. 


Simulation [17, 19] and emulation [25] provide tightly 
controlled environments to run repeatable experiments, 
but lack scale and realism; they neither extend all the 
way to the end-user nor carry real user traffic. Special 
isolated testbeds [10, 22, 3] allow testing at scale, and 
can carry real user traffic, but are usually dedicated to a 
particular type of experiment and are beyond the budget 
of most researchers. 

Without the means to realistically test a new idea there 
has been relatively little technology transfer from the re- 
search lab to real-world networks. Network vendors are 
understandably reluctant to incorporate new features be- 
fore they have been thoroughly tested at scale, in realistic 
conditions with real user traffic. This slows the pace of 
innovation, and many good ideas never see the light of 
day. 

Peeking over the wall to the distributed systems com- 
munity, things are much better. PlanetLab has proved in- 
valuable as a way to test new distributed applications at 
scale (over 1,000 nodes worldwide), realistically (it runs 
real services, and real users opt in), and offers a straight- 
forward path to real deployment (services developed in a 
PlanetLab slice are easily ported to dedicated servers). 

In the past few years, the networking research commu- 
nity has sought an equivalent platform, funded by pro- 
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grams such as GENT [8], FIRE [6], etc. The goal is to 
allow new network algorithms, features, protocols or ser- 
vices to be deployed at scale, with real user traffic, on a 
real topology, at line-rate, with real users; and in a man- 
ner that the prototype service can easily be transferred 
to run in a production network. Examples of experimen- 
tal new services might include a new routing protocol, 
a network load-balancer, novel methods for data center 
routing, access control, novel hand-off schemes for mo- 
bile users or mobile virtual machines, network energy 
managers, and so on. 

The network testbeds that come closest to achieving 
this today are VINI [1] and Emulab [25]: both provide a 
shared physical infrastructure allowing multiple simulta- 
neous experiments to evaluate new services on a physi- 
cal testbed. Users may develop code to modify both the 
data plane and the control plane within their own isolated 
topology. Experiments may run real routing software, 
and expose their experiments to real network events. Em- 
ulab is concentrated in one location, whereas VINI is 
spread out across a wide area network. 

VINI and Emulab trade off realism for flexibility in 
three main ways. 


Speed: In both testbeds packet processing and forwarding 
is done in software by a conventional CPU. This makes 
it easy to program a new service, but means it runs much 
slower than in a real network. Real networks in enter- 
prises, data centers, college campuses and backbones are 
built from switches and routers based on ASICs. ASICs 
consistently outperform CPU-based devices in terms of 
data-rate, cost and power; for example, a single switch- 
ing chip today can process over 600Gb/s [2]. 


Scale: Because VINI and Emulab don’t run new network- 
ing protocols on real hardware, they must always exist as 
a parallel testbed, which limits their scale. It would, for 
example, be prohibitively expensive to build a VINI or 
Emulab testbed to evaluate data-center-scale experiments 
requiring thousands or tens of thousands of switches, 
each with a capacity of hundreds of gigabits per second. 
VINI’s geographic scope is limited by the locations will- 
ing to host special servers (42 today). Without enormous 
investment, it is unlikely to grow to global scale. Emu- 
lab can grow larger, as it is housed under one roof, but 
is still unlikely to grow to a size representative of a large 
network. 


Technology transfer: An experiment running on a net- 
work of CPUs takes considerable effort to transfer to 
specialized hardware; the development styles are quite 
different, and the development cycle of hardware takes 
many years and requires many millions of dollars. 

But perhaps the biggest limitation of a dedicated 
testbed is that it requires special infrastructure: equip- 
ment has to be developed, deployed, maintained and sup- 
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ported; and when the equipment is obsolete it needs to be 
updated. Networking testbeds rarely last more than one 
generation of technology, and so the immense engineer- 
ing effort is quickly lost. 

Our goal is to solve this problem. We set out to answer 
the following question: can we build a testbed that is 
embedded into every switch and router of the production 
network (in college campuses, data centers, WANs, en- 
terprises, WiFi networks, and so on), so that the testbed 
would automatically scale with the global network, rid- 
ing on its coat-tails with no additional hardware? If 
this were possible, then our college campus networks— 
for example—interconnected as they are by worldwide 
backbones, could be used simultaneously for production 
traffic and new WAN routing experiments; similarly, an 
existing data center with thousands of switches can be 
used to try out new routing schemes. Many of the goals 
of programs like GENI and FIRE could be met without 
needing dedicated network infrastructure. 

In this paper, we introduce FlowVisor which aims to 
turn the production network itself into a testbed (Fig- 
ure 1). That is, FlowVisor allows experimenters to eval- 
uate ideas directly in the production network (not run- 
ning in a dedicated testbed alongside it) by “slicing” the 
hardware already installed. Experimenters try out their 
ideas in an isolated slice, without the need for dedicated 
servers or specialized hardware. 


1.1. Contributions. 


We believe our work makes five main contributions: 


Runs on deployed hardware and at real line-rates. 
FlowVisor introduces a software slicing layer between 
the forwarding and control planes on network devices. 
While FlowVisor could slice any control plane message 
format, in practice we implement the slicing layer with 
OpenFlow [16]. To our knowledge, no previously pro- 
posed slicing mechanism allows a user-defined control 
plane to control the forwarding in deployed production 
hardware. Note that this would not be possible with 
VLANs—while they crudely separate classes of traffic, 
they provide no means to control the forwarding plane. 
We describe the slicing layer in §2 and FlowVisor’s 
architecture in §3. 


Allows real users to opt-in on a per-flow basis. 
FlowVisor has a policy language that maps flows 
to slices. By modifying this mapping, users can easily 
try new services, and experimenters can entice users to 
bring real traffic. We describe the rules for mapping 
flows to slices in §3.2. 


Ports easily to non-sliced networks. FlowVisor (and its 
slicing) is transparent to both data and control planes, 
and therefore, the control logic is unaware of the slicing 
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layer. This property provides a direct path for vendor 
adoption. In our OpenFlow-based implementation, nei- 
ther the OpenFlow switches or the controllers need be 
modified to interoperate with FlowVisor (§3.3). 


Enforces strong isolation between slices. Flow Visor 
blocks and rewrites control messages as they cross the 
slicing layer. Actions of one slice are prevented from 
affecting another, allowing experiments to safely coexist 
with real production traffic. We describe the details 
of the isolation mechanisms in 84 and evaluate their 
effectiveness in §5. 


Operates on deployed networks FlowVisor has been 

deployed in our production campus network for the last 7 
months. Our deployment consists of 20+ users, 40+ net- 
work devices, a production traffic slice, and four stand- 
ing experimental slices. In §6, we describe our cur- 
rent deployment and future plans to expand into seven 
other campus networks and two research backbones in 
the coming year. 


2 Slicing Control & Data Planes 


On today’s commercial switches and routers, the con- 
trol plane and data planes are usually logically distinct 
but physically co-located. The control plane creates and 
populates the data plane with forwarding rules, which 
the data plane enforces. In a nutshell, FlowVisor as- 
sumes that the control plane can be separated from the 
data plane, and it then slices the communication between 
them. This slicing approach can work several ways: for 
example, there might already be a clean interface be- 
tween the control and data planes inside the switch. More 
likely, they are separated by a common protocol (e.g., 
OpenFlow [16] or ForCes [7]). In either case, Flow Visor 
sits between the control and data planes, and from this 
vantage point enables a single data plane to be controlled 
by multiple control planes—each belonging to a separate 
experiment. 

With FlowVisor, each experiment runs in their own 
slice of the network. A researcher, Bob, begins by re- 
questing a network slice from Alice, his network admin- 
istrator. The request specifies his requirements including 
topology, bandwidth, and the set of traffic—defined by a 
set of flows, or flowspace—that the slice controls. Within 
his slice, Bob has his own control plane where he puts the 
control logic that defines how packets are forwarded and 
rewritten in his experiment. For example, imagine that 
Bob wants to create a new /ttp load-balancer to spread 
port 80 traffic over multiple web servers. He requests 
a Slice: its topology should encompass the web servers, 
and its flowspace should include all flows with port 80. 
He is allocated a control plane where he adds his load- 
balancing logic to control how flows are routed in the 
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Figure 2: Classical network device architectures have 
distinct forwarding and control logic elements (left). By 
adding a transparent slicing layer between the forward- 
ing and control elements, FlowVisor allows multiple 
control logics to manage the same forwarding element 
(middle). In implementation, FlowVisor uses OpenFlow 
and sits between an OpenFlow switch—the forwarding 
element—and multiple OpenFlow controllers—the con- 
trol logic (right). 


data plane. He may advertise his new service so as to at- 
tract users. Interested users “opt-in” by contacting their 
network administrator to add a subset of their flows to 
the flowspace of Bob’s slice. 

In this example, FlowVisor allocates a control plane 
for Bob, and allows him to control his flows (but no oth- 
ers) in the data plane. Any events associated with his 
flows (e.g. when a new flow starts) are sent to his control 
plane. FlowVisor enforces his slice’s topology by only 
allowing him to control switches within his slice. 

FlowVisor slices the network along multiple dimen- 
sions, including topology, bandwidth, and forwarding 
table entries. Slices are isolated from each other, so 
that actions in one slice—be they faulty, malicious, or 
otherwise—do not impact other slices. 


2.1 Slicing OpenFlow 


While architecturally FlowVisor can slice any data 
plane/control plane communication channel, we built our 
prototype on top of OpenFlow. 

OpenFlow [16, 18] is an open standard that allows re- 
searchers to directly control the way packets are routed 
in the network. As described above, in a classical net- 
work architecture, the control logic and the data path are 
co-located on the same device and communicate via an 
internal proprietary protocol and bus. In OpenFlow, the 
control logic is moved to an external controller (typi- 
cally a commodity PC); the controller talks to the dat- 
apath (over the network itself) using the OpenFlow pro- 
tocol (Figure 2, right). The OpenFlow protocol abstracts 
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Figure 3: FlowVisor allows users (Doug) to delegate 
control of subsets of their traffic to distinct researchers 
(Alice, Bob, Cathy). Each research experiment runs in 
its own, isolated network slice. 


forwarding/routing directives as “flow entries”. A flow 
entry consists of a bit pattern, a list of actions, and a set 
of counters. Each flow entry states “perform this list of 
actions on all packets in this flow” where a typical action 
is “forward the packet out port X” and the flow is defined 
as the set of packets that match the given bit pattern. The 
collection of flow entries on a network device is called 
the “flow table”. 


When a packet arrives at a switch or router, the device 
looks up the packet in the flow table and performs the 
corresponding set of actions. If the packet doesn’t match 
any entry, the packet is queued and a new flow event is 
sent across the network to the OpenFlow controller. The 
controller responds by adding a new rule to the flow table 
to handle the queued packet. Subsequent packets in the 
same flow will be handled without contacting the con- 
troller. Thus, the external controller need only be con- 
tacted for the first packet in a flow; subsequent packets 
are forwarded at the switch’s full line rate. 


Architecturally, OpenFlow exploits the fact that mod- 
ern switches and routers already logically implement 
flow entries and flow tables—typically in hardware as 
TCAMs. As such, a network device can be made 
OpenFlow-compliant via firmware upgrade. 


Note that while OpenFlow allows researchers to 
experiment with new network protocols on deployed 
hardware, only a single researcher can use/control an 
OpenFlow-enabled network at a time. As a result, with- 
out FlowVisor, OpenFlow-based research is limited to 
isolated testbeds, limiting its scope and realism. Thus, 
FlowVisor’s ability to slice a production network is an or- 
thogonal and indepenent contribution to OpenFlow-like 
software-defined networks. 
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3 FlowVisor Design 


To restate our main goal, FlowVisor aims to use the pro- 
duction network as a testbed. In operation, the Flow Visor 
slices the network by slicing each of the network’s corre- 
sponding packet forwarding devices (e.g., switches and 
routers) and links (Figure 3). 


With the FlowVisor, 


e Network resources are sliced in terms of their band- 
width, topology, forward table entries, and device CPU 
(§3.1). 


e Each slice has control over a set of flows, called its 
flowspace. Users can arbitrarily add (opt-in) and remove 
(opt-out) their own flows from a slice’s flowspace at any- 
time (§3.2). 


e Each slice has its own distinct, programmable con- 
trol logic, that manages how packets are forwarded and 
rewritten for traffic in the slice’s flowspace. In practice, 
each slice owner implements their slice-specific control 
logic as an OpenFlow controller. The FlowVisor inter- 
poses between data and control planes by proxying con- 
nections between OpenFlow switches and each slice con- 
troller (§3.3). 


e Slices are defined using a slice definition policy lan- 
guage. The language specifies the slice’s resource limits, 
flowspace, and controller’s location in terms of IP and 
TCP port-pair (83.4). 


3.1 Slicing Network Resources 


Slicing a network means correctly slicing all of the cor- 
responding network resources. There are four primary 
slicing dimensions: 


Topology. Each slice has its own view of network nodes 
(e.g., switches and routers) and the connectivity between 
them. In this way, slices can experience simulated net- 
work events such as link failure and forwarding loops. 


Bandwidth. Each slice has its own fraction of bandwidth 
on each link. Failure to isolate bandwidth would allow 
one slice to affect, or even starve, another slice’s through- 
put. 


Device CPU. Each slice is limited to what fraction of 
each device’s CPU that it can consume. Switches 
and routers typically have very limited general purpose 
computational resources. Without proper CPU slicing, 
switches will stop forwarding slow-path packets (§5.3.2), 
drop statistics requests, and, most importantly, will stop 
processing updates to the forwarding table. 


Forwarding Tables. Each slice has a finite quota of for- 
warding rules. Network devices typically support a finite 
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Figure 4: The FlowVisor intercepts OpenFlow messages 
from guest controllers (1) and, using the user’s slicing 
policy (2), transparently rewrites (3) the message to con- 
trol only a slice of the network. Messages from switches 
(4) are forwarded only to guests if it matches their slice 
policy. 


number of forwarding rules (e.g., TCAM entries). Fail- 
ure to isolate forwarding entries between slices might al- 
low one slice to prevent another from forwarding pack- 
ets. 


3.2 Flowspace and Opt-In 


A slice controls a subset of traffic in the network. The 
subset is defined by a collection of packet headers that 
form a well-defined (but not necessarily contiguous) sub- 
space of the entire space of possible packet headers. Ab- 
stractly, if packet headers have n bits, then the set of 
all possible packet header forms an n-dimensional space. 
An arriving packet is a single point in that space repre- 
senting all packets with the same header. Similar to the 
geometric representation used to describe access control 
lists for packet classification [14], we use this abstrac- 
tion to partition the space into regions (flowspace) and 
map those regions to slices. 

The flowspace abstraction helps us manage users who 
opt-in. To opt-in to a new experiment or service, users 
signal to the network administrator that they would like 
to add a subset of their flows to a slice’s flowspace. Users 
can precisely decide their level of involvement in an ex- 
periment. For example, one user might opt-in all of their 
traffic to a single experiment, while another user might 
just opt-in traffic for one application (e.g., port 80 for 
HTTP), or even just a specific flow (by exactly specify- 
ing all of the fields of a header). In our prototype the 
opt-in process is manual; but in a ideal system, the user 
would be authenticated and their request checked auto- 
matically against a policy. 

For the purposes of testbed we concluded flow-level 
opt-in is adequate—in fact, it seems quite powerful. An- 
other approach might be to opt-in individual packets, 
which would be more onerous. 


3.3. Control Message Slicing 


By design, FlowVisor is a slicing layer interposed be- 
tween data and control planes of each device in the net- 
work. In implementation, FlowVisor acts as a transpar- 
ent proxy between OpenFlow-enabled network devices 
(acting as dumb data planes) and multiple OpenFlow 
slice controllers (acting as programmable control logic— 
Figure 4). All OpenFlow messages between the switch 
and the controller are sent through FlowVisor. FlowVi- 
sor uses the OpenFlow protocol to communicate upwards 
to the slice controllers and and downwards to OpenFlow 
switches. Because FlowVisor is transparent, the slice 
controllers require no modification and believe they are 
communicating directly with the switches. 


We illustrate the FlowVisor’s operation by extend- 
ing the example from §2 (Figure 4). Recall that a re- 
searcher, Bob, has created a slice that is an HTTP proxy 
designed to spread all HTTP traffic over a set of web 
servers. While the controller will work on any HTTP 
traffic, Bob’s FlowVisor policy slices the network so 
that he only sees traffic from users that have opted-in 
to his slice. His slice controller doesn’t know the net- 
work has been sliced, so doesn’t realize it only sees a 
subset of the HTTP traffic. The slice controller thinks 
it can control, i.e., insert flow entries for, all HTTP traf- 
fic from any user. When Bob’s controller sends a flow 
entry to the switches (e.g., to redirect HTTP traffic to 
a particular server), FlowVisor intercepts it (Figure 4- 
1), examines Bob’s slice policy (Figure 4-2), and re- 
writes the entry to include only traffic from the allowed 
source (Figure 4-3). Hence the controller is controlling 
only the flows it is allowed to, without knowing that the 
FlowVisor is slicing the network underneath. Similarly, 
messages that are sourced from the switch (e.g., a new 
flow event—Figure 4-4) are only forwarded to guest con- 
trollers whose flowspace match the message. That is, it 
will only be forwarded to Bob if the new flow is HTTP 
traffic from a user that has opted-in to his slice. 


Thus, FlowVisor enforces transparency and isolation 
between slices by inspecting, rewriting, and policing 
OpenFlow messages as they pass. Depending on the re- 
source allocation policy, message type, destination, and 
content, the FlowVisor will forward a given message un- 
changed, translate it to a suitable message and forward, 
or “bounce” the message back to its sender in the form 
of an OpenFlow error message. For a message sent 
from slice controller to switch, FlowVisor ensures that 
the message acts only on traffic within the resources as- 
signed to the slice. For a message in the opposite di- 
rection (switch to controller), the FlowVisor examines 
the message content to infer the corresponding slice(s) 
to which the message should be forwarded. Slice con- 
trollers only receive messages that are relevant to their 
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Figure 5: FlowVisor can trivially recursively slice an al- 
ready sliced network, creating hierarchies of Flow Visors. 























network slice. Thus, from a slice controller’s perspec- 
tive, FlowVisor appears as a switch (or a network of 
switches); from a switch’s perspective, FlowVisor ap- 
pears as a controller. 

FlowVisor does not require a 1-to-1 mapping between 
FlowVisor instances and physical switches. One FlowVi- 
sor instance can slice multiple physical switches, and 
even re-slice an already sliced network (Figure 5) . 


3.4 Slice Definition Policy 


The slice policy defines the network resources, flows- 
pace, and OpenFlow slice controller allocated to each 
slice. Each policy is described by a text configuration 
file—one file per slice. In terms of resources, the policy 
defines the fraction of total link bandwidth available to 
this slice ($4.3) and the budget for switch CPU and for- 
warding table entries. Network topology is specified as a 
list of network nodes and ports. 

The flowspace for each slice is defined by an ordered 
list of tuples similar to firewall rules. Each rule descrip- 
tion has an associated action, e.g., allow, read-only, or 
deny, and is parsed in the specified order, acting on the 
first matching rule. The rules define the flowspace a slice 
controls. Read-only rules allow slices to receive Open- 
Flow control messages and query switch statistics, but 
not to write entries into the forwarding table. Rules are 
allowed to overlap, as described in the example below. 

Let’s take a look at an example set of rules. Alice, the 
network administrator, wants to allow Bob to conduct an 
HTTP load-balancing experiment. Bob has convinced 
some of his colleagues to opt-in to his experiment. Al- 
ice wants to maintain control of all traffic that is not part 
of Bob’s experiment. She wants to passively monitor all 
network performance, to keep an eye on Bob and the pro- 
duction network. 

Here is a set of rules Alice could install in the FlowVi- 
sor: 


Bob’s Experimental Network includes all HTTP traffic 
to/from users who opted into his experiment. Thus, his 
network is described by one rule per user: 
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Allow: tcp_port:80 and ip=user_ip. 
OpenFlow messages from the switch matching any of 
these rules are forwarded to Bob’s controller. Any flow 
entries that Bob tries to insert are modified to meet these 
rules. 


Alice’s Production Network is the complement of Bob’s 
network. For each user in Bob’s experiment, the produc- 
tion traffic network has a negative rule of the form: 
Deny: tcp-port:80 and ip=user_ip. The 
production network would have a final rule that matches 
all flows: Allow: all. 


Thus, only OpenFlow messages that do not go to Bob’s 
network are sent to the production network controller. 
The production controller is allowed to insert forwarding 
entries so long as they do not match Bob’s traffic. 


Alice’s Monitoring Network is allowed to see all traffic 
in all slices. Ithas one rule, Read-only: all. 


This rule-based policy, though simple, suffices for the 
experiments and deployment described in this paper. We 
expect that future Flow Visor deployments will have more 
specialized policy needs, and that researchers will create 
new resource allocation policies. 


4 FlowVisor Implementation 


We implemented FlowVisor in approximately 8000 lines 
of C and the code is publicly available for download 
from www. openflow.org. The notable parts of the im- 
plementation are the transparency and isolation mech- 
anisms. Critical to its design, FlowVisor acts as a 
transparent slicing layer and enforces isolation between 
slices. In this section, we describe how FlowVisor 
rewrites control messages—both down to the forwarding 
plane and up to the control plane—to ensure both trans- 
parency and strong isolation. Because isolation mech- 
anisms vary by resource, we describe each resource in 
turn: bandwidth, switch CPU, and forwarding table en- 
tries. In our deployment, we found that the switch CPU 
was the most constrained resource, so we devote partic- 
ular care to describing its slicing mechanisms. 


4.1 Messages to Control Plane 


FlowVisor carefully rewrites messages from the Open- 
Flow switch to the slice controller to ensure transparency. 
First, FlowVisor only sends control plane messages to 
a slice controller if the source switch is actually in the 
slice’s topology. Second, FlowVisor rewrites Open- 
Flow feature negotiation messages so that the slice con- 
troller only sees the physical switch ports that appear 
in the slice. Third, OpenFlow port up/port down mes- 
sages are similarly pruned and only forwarded to the af- 
fected slices. Using these message rewriting techniques, 
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FlowVisor can easily simulate network events, such as 
link and node failures. 


4.2 Messages to Forwarding Plane 


In the opposite direction, FlowVisor also rewrites mes- 
sages from the slice controller to the OpenFlow switch. 
The most important messages to the forwarding plane 
were insertions and deletions to the forwarding table. 
Recall (§2.1) that in OpenFlow, forwarding rules consist 
of a flow rule definition, i.e., a bit pattern, and a set of 
actions. To ensure both transparency and isolation, the 
Flow Visor rewrites both the flow definition and the set of 
actions so that they do not violate the slice’s definition. 

Given a forwarding rule modification, the FlowVisor 
rewrites the flow definition to intersect with the slice’s 
flowspace. For example, Bob’s flowspace gives him con- 
trol over HTTP traffic for the set of users—e.g., users 
Doug and Eric—that have opted into his experiment. If 
Bob’s slice controller tried to create a rule that affected 
all of Doug’s traffic (HTTP and non-HTTP), then the 
FlowVisor would rewrite the rule to only affect the in- 
tersection, i.e., only Doug’s HTTP traffic. If the inter- 
section between the desired rule and the slice definition 
is null, e.g., Bob tried to affect traffic outside of his 
slice, e.g.., Doug’s non-HTTP traffic, then the FlowVi- 
sor would drop the control message and return an error 
to Bob’s controller. Because flowspaces are not necessar- 
ily contiguous, the intersection between the desired rule 
and the slice’s flowspace may result in a single rule be- 
ing expanded into multiple rules. For example, if Bob 
tried to affect all traffic in the system in a single rule, the 
FlowVisor would transparently expand the single rule in 
to two rules: one for each of Doug’s and Eric’s HTTP 
traffic. 

FlowVisor also rewrites the lists of actions in a for- 
warding rule. For example, if Bob creates a rule to send 
out all ports, the rule is rewritten to send to just the sub- 
set of ports in Bob’s slice. If Bob tries to send out a port 
that is not in his slice, the FlowVisor returns a “action 
is invalid” error (recall that from above, Bob’s controller 
only discovers the ports that do exist in his slice, so only 
in error would he use a port outside his slice). 


4.3 Bandwidth Isolation 


Typically, even relatively modest commodity network 
hardware has some capability for basic bandwidth iso- 
lation [13]. The most recent versions of OpenFlow ex- 
pose native bandwidth slicing capabilities in the form of 
per-port queues. The FlowVisor creates a per-slice queue 
on each port on the switch. The queue is configured for 
a fraction of link bandwidth, as defined in the slice def- 
inition. To enforce bandwidth isolation, the FlowVisor 


rewrites all slice forwarding table additions from “send 
out port X” to “send out queue Y on port X ”, where Y 
is a slice-specific queue ID. Thus, all traffic from a given 
slice is mapped to the traffic class specified by the re- 
source allocation policy. While any queuing discipline 
can be used (weighted fair queuing, deficit round robin, 
strict partition, etc.), in implementation, FlowVisor uses 
minimum bandwidth queues. That is, a slice configured 
for X % of bandwidth will receive at least X% and pos- 
sibly more if the link is under-utilized. We choose min- 
imum bandwidth queues to avoid issues of bandwidth 
fragmentation. We evaluate the effectiveness of band- 
width isolation in §5. 


4.4 Device CPU Isolation 


CPUs on commodity network hardware are typically 
low-power embedded processors and are easily over- 
loaded. The problem is that in most hardware, a highly- 
loaded switch CPU will significantly disrupt the network. 
For example, when a CPU becomes overloaded, hard- 
ware forwarding will continue, but the switch will stop 
responding to OpenFlow requests, which causes the for- 
warding tables to enter an inconsistent state where rout- 
ing loops become possible, and the network can quickly 
become unusable. 

Many of the CPU-isolation mechanisms presented are 
not inherent to FlowVisor’s design, but rather a work- 
around to deal with the existing hardware abstraction ex- 
posed by OpenFlow. A better long-term solution would 
be to expose the switch’s existing process scheduling 
and rate-limiting features via the hardware abstraction. 
Some architectures, e.g., the HP ProCurve 5400, already 
use rate-limiters to enforce CPU isolation between Open- 
Flow and non-OpenFlow VLANs. Adding these features 
to OpenFlow is ongoing. 

There are four main sources of load on a switch CPU: 
(1) generating new flow messages, (2) handling requests 
from controller, (3) forwarding “slow path” packets, and 
(4) internal state keeping. Each of these sources of load 
requires a different isolation mechanism. 

New Flow Messages. In OpenFlow, when a packet 
arrives at a switch that does not match an entry in the 
flow table, a new flow message is sent to the controller. 
This process consumes processing resources on a switch 
and if message generation occurs too frequently, the CPU 
resources can be exhausted. To prevent starvation, the 
FlowVisor rate limits the new flow message arrival rate. 
In implementation, the FlowVisor tracks the new flow 
message arrival rate for each slice, and if it exceeds some 
threshold, the Flow Visor inserts a forwarding rule to drop 
the offending packets for a short period. 

For example, the FlowVisor keeps a token-bucket style 
counter for each flow space rule (“Bob’s slice gets (1) 
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all HTTP traffic and (2) all HTTPS traffic’, ie., two 
rules/counters). Each time the FlowVisor receives a 
new flow event, the token bucket that matches the flow 
gets decremented (for Bob’s slice, packets that match 
HTTP count against token bucket #1, packets that match 
HTTPS count against #2). Once the bucket is emp- 
tied, the FlowVisor inserts a lowest-priority rule into the 
switch to drop all packets in that flowspace rule, ie., 
from the example, if the token bucket corresponding to 
HTTPS is emptied, then the flowvisor will cause the 
switch to drop all HTTPS packets—without generating 
new flow events. The rule is set to expire in | second, so 
it is effectively a very coarse rate limiter. In practice, if 
a slice has control over “all traffic”, this mechanism ef- 
fectively blocks all new flow events from saturating the 
switch CPU or going to the controller, while allowing all 
existing flows to continue without change. We discuss 
the effectiveness of this technique in 85.3.2. 


Controller Requests. The requests an OpenFlow con- 
troller sends to the switch, e.g., to edit the forwarding 
table or query statistics, consume CPU resources. For 
each slice, the FlowVisor limits CPU consumption by 
throttling the OpenFlow message rate to a maximum rate 
per second. Because the amount of CPU resources con- 
sumed vary by message type and by hardware implemen- 
tation, it is future work to dynamically infer the cost of 
each OpenFlow message for each hardware platform. 


Slow-Path Forwarding. Packets that traverse the 
“slow” path—i.e., not the “fast” dedicated hardware for- 
warding path—consume CPU resources. Thus, an Open- 
Flow rule that forwards packets via the slow path can 
consume arbitrary CPU resources. 


This is because, in implementations, most switches 
only implement a subset of OpenFlow’s functionality 
in their hardware. For example, the ASICs on most 
switches do not support sending one packet out exactly 
two ports (they support unicast and broadcast, but not 
in between). To emulate this behavior, the switches ac- 
tually process these types of flows in their local CPUs, 
i.e,. on their slow path. Unfortunately, as mentioned 
above, these are embedded CPUs and are not as powerful 
as those on, for example, commodity PCs. 


FlowVisor prevents slice controllers from insert- 
ing slow-path forwarding rules by rewriting them as 
one-time packet forwarding events, i.e., an OpenFlow 
“packet out” message. As a result, the slow-path packets 
are rate limited by the above two isolation mechanisms: 
new flow messages and controller request rate limiting. 


Internal Bookkeeping. All network devices use CPU 
to update their internal counters, process events, update 
counters, etc. So, care must be taken to ensure that there 
is sufficient CPU available for the switch’s bookkeep- 
ing. The FlowVisor accounts for this by ensuring that 
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the above rate limits are tuned to leave sufficient CPU 
resources for the switch’s internal function. 


4.5 Flow Entry Isolation 


The FlowVisor counts the number of flow entries used 
per slice and ensures that each slice does not exceed a 
preset limit. The FlowVisor increments a counter for 
each rule a guest controller inserts into the switch and 
then decrements the counter when a rule expires. Due 
to hardware limitations, certain switches will internally 
expand rules that match multiple input ports, so the 
FlowVisor needs to handle this case specially. When a 
guest controller exceeds its flow entry limit, any new rule 
insertions received a “table full” error message. 


5 Evaluation 


To motivate the efficiency and robustness of the design, 
in this section we evaluate the FlowVisor’s scalability, 
performance, and isolation properties. 


5.1 Scalability 


A single FlowVisor instance scales well enough to serve 
our entire 40+ switch, 7 slice deployment with minimal 
load. As a result, we create an artificially high work- 
load to evaluate our implementation’s scaling limits. The 
FlowVisor’s workload is characterized by the number of 
switches, slices, and flowspace rules per slice as well as 
the rate of new flow messages. We present the results for 
two types of workloads: one that matches what we ob- 
serve from our deployment (1 slice, 35 rules per slice, 28 
switches!, 1.55 new flows per second per switch) and the 
other a synthetic workload (10 switches, 100 new flows 
per second per switch, | slice, 1000 rules per slice) de- 
signed to stress the system. In each graph, we fix three 
variables according to their workload and vary the forth. 

Our evaluation measured FlowVisor’s CPU utilization 
using a custom script. The script creates a configurable 
number of OpenFlow connections to the FlowVisor, and 
each connections simulates a switch that sends new flow 
messages to the FlowVisor at a prescribed rate. With 
each experiment, we configured the FlowVisor’s num- 
ber of slices and flowspace rules per slice. The new 
flow messages were carefully crafted to match only the 
last rule of each slice, causing the worst case behavior 
in the FlowVisor’s linear search of the flowspace rules. 
Each test was run for 5 minutes and we recorded the 
CPU utilization of the FlowVisor process once per sec- 
ond, so each result is the average of 300 samples (shown 
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Figure 6: FlowVisor scales linearly with new flow rate, number of slices, switches, and flowspace rules. We generate 
high synthetic workloads to explore the scalability because the workloads observed in our deployment were non- 


taxing. 


with one standard deviation). The FlowVisor ran on a 
quad-core Intel Xeon 3GHz system running 32-bit De- 
bian Linux 5.0 (Lenny). 

Our results with the synthetically high workload show 
that the FlowVisor’s CPU load scales linearly in each of 
these four workload dimensions (as summarized in Fig- 
ure 6). The result is promising, but not surprising. In- 
tuitively, the FlowVisor can process a fixed number of 
OpenFlow messages per second (the product of number 
of switches by new flow rate) and each message must 
be matched against each rule of each slice, so the to- 
tal load is approximately the product of the four work- 
load variables. The synthetic workload with 1,000 new 
flows/s (10 switches by 100 new flows/s) is comparable 
to the peak rate of published real-world enterprise net- 
works [20]: an 8,000 host network generated a peak rate 
of 1,200 new flows per second. Thus, we believe that 
a single FlowVisor instance could manage a large enter- 
prise network. By contrast, our observed workload fluc- 
tuated between 0% and 10% CPU, roughly independent 
of the experimental variable. This validates our belief 
that our deployment can grow significantly using a sin- 
gle FlowVisor instance. 

While our results show that FlowVisor scales well be- 
yond our current requirements and workload, it is worth 
noting that it is possible to achieve even further scaling 
by moving to a multi-threaded implementation (the cur- 
rent implementation is single threaded) or even to multi- 
ple FlowVisor instances. 


5.2 Performance Overhead 


Adding an additional layer between control and data 
planes adds overhead to the system. However, as a re- 
sult of our design, the FlowVisor does not add over- 
head to the data plane. That is, with FlowVisor, packets 
are forwarded at full line rate. Nor does the FlowVisor 
add overhead to the control plane: control-level calcula- 
tions like route selection proceed at their un-sliced rate. 
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FlowVisor only adds overhead to actions that cross be- 
tween the control and data plane layers. 

To quantify this cross-layer overhead, we measure the 
increased response time for slice controller requests with 
and without the FlowVisor. Specifically, we consider 
the response time of the OpenFlow messages most com- 
monly used in our network and by our monitoring soft- 
ware: the new flow and the port status request messages. 

In OpenFlow, a switch sends a new flow message to 
its controller when an arriving packet does not match any 
existing forwarding rules. We examine the increased de- 
lay of the new flow message to better understand how 
the FlowVisor affects connection setup latency. In our 
experiment, we connect a machine with two interfaces to 
a switch. One interface sends a packet every 20ms (50 
packets per second) to the switch and the other interface 
is the OpenFlow control channel. We measure the time 
between sending the packet and receiving the new flow 
message using libpcap. Our results (Figure 7(a)) show 
that the Flow Visor increases time from the switch to con- 
troller by an average of 16ms. For latency sensitive ap- 
plications, e.g., web services in large data centers, 16ms 
may be too much overhead. However, new flow mes- 
sages add 12ms latency on average even without FlowVi- 
sor, so we believe that slice controllers in those envi- 
ronments will likely proactively insert flow entries into 
switches, avoiding this latency all together. We point out 
that the algorithm FlowVisor uses to process new flow 
messages is naive, and its run-time grows linearly with 
the number of flowspace rules (85.1). We are yet to ex- 
periment with the many classification algorithms that can 
be expected to improve the lookup speed. 

A port status request is a message sent by the con- 
troller to the switch to query the byte and packet coun- 
ters for a specific port. The switch returns the counters 
in a corresponding port status reply message. We choose 
to study the port status request because we believe it to 
be a worst case for FlowVisor overhead. The message 
is very cheap to process at the switch and controller, but 
expensive for the FlowVisor to process: it has to edit the 
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Figure 7: CDF of slicing overhead for OpenFlow new flow messages and port status requests. 


message per slice to remove statistics for ports that do 
not appear in a sliced topology. 

We wrote a special-purpose controller that sent ap- 
proximately 200 port status requests per second and mea- 
sured the response times. The rate was chosen to ap- 
proximate the maximum request rate supported by the 
hardware. The controller, switch, and FlowVisor were 
all on the same local area network, but controller and 
FlowVisor were hosted on separate PCs. Obviously, the 
overhead can be increased by moving the FlowVisor ar- 
bitrarily far away from the controller, but we design this 
experiment to quantify the FlowVisor’s processing over- 
head. Our results show that adding the FlowVisor causes 
an average overhead for port status responses of 0.48 mil- 
liseconds(Figure 7(b)). We believe that port status re- 
sponse time being faster than new flow processing time 
is not inherent, but simply a matter of better optimization 
for port status request handling. 


5.3 Isolation 
5.3.1 Bandwidth 


To validate the FlowVisor’s bandwidth isolation prop- 
erties, we run an experiment where two slices compete 
for bandwidth on a shared link. We consider the worst 
case for bandwidth isolation: the first slice sends TCP- 
friendly traffic and the other slice sends TCP-unfriendly 
constant-bit-rate (CBR) traffic at full link speed (1Gbps). 
We believe these traffic patterns are representative of a 
scenario where production slice (TCP) shares a link with, 
for example, a slice running a DDoS experiment (CBR). 

This experiment uses 3 machines—two sources and a 
common sink—all connected via the same HP ProCurve 
5400 switch, i.e., the switch found in our wiring closet. 
The traffic is generated by iperf in TCP mode for the 
TCP traffic and UDP mode at 1Gbps for the CBR traffic. 
We repeat the experiment twice: with and without the 
FlowVisor’s bandwidth isolation features enabled (Fig- 
ure 8(a)). With the bandwidth isolation disabled (“with- 
out Slicing”), the CBR traffic consumes nearly all the 
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bandwidth and the TCP traffic averages 1.2% of the link 
bandwidth. With the traffic isolation features enabled 
(“with 30/70% reservation’), the FlowVisor maps the 
TCP slice to a QoS class that guarantees at least 70% 
of link bandwidth and maps the CBR slice to a class that 
guarantees at least 30%. Note that theses are minimum 
bandwidth guarantees, not maximum. With the band- 
width isolation features enabled, the TCP slice achieves 
an average of 64.2% of the total bandwidth and the CBR 
an average of 28.5%. Note that the event at 20 seconds 
where the CBR with QoS jumps and the TCP with QoS 
experiences a corresponding dip. We believe this to be 
the result of a TCP congestion event that allowed the 
CBR traffic to temporarily take advantage of additional 
available bandwidth, exactly as the minimum bandwidth 
queue is designed. 


5.3.2 Switch CPU 


To quantify our ability to isolate the switch CPU re- 
source, we show two experiments that monitor CPU- 
usage over time of a switch with and without isolation 
enabled. In the first experiment (Figure 8(b)), the Open- 
Flow controller maliciously sends port stats request mes- 
sages (as above) at increasing speeds (2,4,8...1024 
requests per second). In our second experiment (Fig- 
ure 8(c)), the switch generates new flow messages faster 
than its CPU can handle and a faulty controller does not 
add a new rule to match them. In both experiments, we 
show the switch’s CPU utilization averaged over one sec- 
ond, and the FlowVisor’s isolation features reduce the 
switch utilization from 100% to a configurable amount. 
In the first experiment, we note that the switch could han- 
dle less than 256 port status requests without appreciable 
CPU load, but immediately goes to 100% load when the 
request rate hits 256 requests per second. In the second 
experiment, the bursts of CPU activity in Figure 8(c) is 
a direct result of using null forwarding rules (84.4) to 
rate limit incoming new flow messages. We expect that 
future versions of OpenFlow will better expose the hard- 
ware CPU limiting features already in switches today. 
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Figure 8: FlowVisor’s bandwidth isolation prevents CBR traffic from starving TCP, and message throttling and new 


flow message rate limiting prevents CPU starvation. 


6 Deployment Experience 


To provide evidence that sliced experimental traffic can 
indeed co-exist with production traffic, we deployed 
FlowVisor on our production network. By “production”, 
we refer to the network that the authors rely on to read 
their daily email, surf the web, etc. Additionally, six 
other campuses are currently using the FlowVisor as part 
of the GENI “meso-scale” infrastructure. In this section, 
we describe our experiences in deploying FlowVisor in 
our production network, its deployment in other cam- 
puses, and briefly describe the experiments that have run 
on the FlowVisor. 


6.1 Stanford Deployment 


At Stanford University, we have been running FlowVi- 
sor continuously on our production network since June 
4th, 2009. Our network consists of 25+ users, 5 NEC 
IP8800 switches, 2 HP ProCurve 5400s, 30 wireless ac- 
cess points, 5 NetFPGA [15] cards acting as OpenFlow 
switches, and a WiMAX base station. Our physical 
network is effectively doubly sliced: first by VLANs 
and then by the FlowVisor. Our network trunks over 
10 VLANs , including traffic for other research groups, 
but only three of those VLANs are OpenFlow-enabled. 
Of the three OpenFlow VLANs, two are sliced by 
FlowVisor. We maintain multiple OpenFlow VLANs 
and FlowVisor instances to allow FlowVisor develop- 
ment without impacting production traffic. 

For each FlowVisor-sliced VLAN, all network de- 
vices point to a single FlowVisor instance, running on 
a 3.0GHz quad-core Intel Xeon with 2 GB of DRAM. 
For maximum uptime, we ran FlowVisor from a wrap- 
per script that instantly restarts it if it should crash. The 
FlowVisor was able to handle restarts seamlessly because 
it does not maintain any hard state in the network. In 
our production slice, we ran NOX’s routing module to 
perform basic forwarding in the network. We will pub- 
lish our slicing administration tools and debugging tech- 
niques. 
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6.2 Deploying on Other Networks 


As part of the GENI “meso-scale” project, we also de- 
ployed FlowVisor onto test networks on six university 
campuses, including University of Washington, Wis- 
consin University, Princeton University, Indiana Univer- 
sity, Clemson University, Rutgers University. In each 
network, we have a staged deployment plan with the 
eventual goal of extending the existing OpenFlow and 
FlowVisor test network to their production networks. 
Each network runs its own FlowVisor. Recently, at the 
8th GENI Engingering Conference (GEC), we demon- 
strated how slices at each campus’s network could be 
combined with tunnels to create a single wide-area net- 
work slice. Currently, we are in the process of extending 
the FlowVisor deployment into two backbone networks 
(Internet2 and National Lambda Rail), with the eventual 
goal of creating a large-scale end-to-end sliceable wide 
area network. 


6.3 Slicing Experience 


In our experience, the two largest causes of network 
instability were unexpected interactions with other de- 
ployed network devices and device CPU exhaustion. 
One problem we had was interacting with a virtual IP 
feature of the router in our building. This feature allows 
multiple physical interfaces to act as a single, logical in- 
terface for redundancy. In implementation, the router 
would reply to ARP requests with the MAC address of 
the logical interface but source packets from any of three 
different MAC addresses corresponding to the physical 
interfaces. As a result, we had to revise the flowspace as- 
signed to the production slice to include all four MAC ad- 
dresses. Another aspect that we did not anticipate is the 
amount of broadcast traffic emitted from non-OpenFlow 
devices. It is quite common for a device to periodi- 
cally send broadcast LLDP, Spanning Tree, and other 
packets. The level of broadcast traffic on the network 
made debugging more difficult and could cause loops if 
our OpenFlow-based loop detection/spanning tree algo- 
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rithms did not match the non-OpenFlow-bases spanning 
tree. 

Another issue was the interaction between OpenFlow 
and device CPU usage. As discussed earlier (§ 4.4), the 
most frequent form of slice isolation violations occurred 
with device CPU. The main form of isolation violation 
occurred when one slice would insert a forwarding rule 
that could only be handled via the switch’s slow path and, 
as a result, would push the CPU utilization to 100%, pre- 
venting slices from updating their forwarding rules. We 
also found that the cost to process an OpenFlow message 
varied significantly by type and by OpenFlow implemen- 
tation particularly with stats requests, e.g., the OpenFlow 
aggregate stats command consumed more CPU than an 
OpenFlow port stats command, but not on all implemen- 
tations. As part of our future work, we plan to compute 
a per-message type costs to each OpenFlow request to 
more precisely slice device CPU. Additionally, the up- 
coming OpenFlow version 1.1 will add support for rate 
limiting messages coming from the fast to slow paths. 


6.4 Experiments 


We’ve demonstrated that FlowVisor supports a wide va- 
riety of network experiments. On our production net- 
work, we ran four networking experiments, each in its 
own slice. All four experiments, including a network 
load-balancer [12], wireless streaming video [26], traffic 
engineering, and a hardware prototyping experiment [9], 
were built on top of NOX [11]. As part of the 7th GENI 
Engingeering Conference, each of the seven campuses 
demonstrated their own, locally designed experiments, 
running in a FlowVisor-enabled slice of the network. 
Our hope is that the FlowVisor will continue to allow 
researchers to run novel experiments in their own net- 
works. 


7 Related Work 


There is a vast array of work related to network exper- 
imentation in both controlled and operational environ- 
ments. Here we scratch the surface by discussing some 
of the more recent highlights. 

The community has benefited from a number of 
testbeds for performing large-scale experiments. The 
two most widely used are PlanetLab [21] and Emu- 
lab [25]. PlanetLab’s primary function has been that of 
an overlay testbed, hosting software services on nodes 
deployed around the globe. Emulab is targeted more 
at localized and controlled experiments run from arbi- 
trary switch-level topologies connected by PCs. Shad- 
owNet [3] exposes virtualization features of specific 
high-end routers, but does not provide per-flow forward- 
ing control or user opt-in. VINI [1], a testbed closely 
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affiliated with PlanetLab, further provides the ability for 
multiple researchers to construct arbitrary topologies of 
software routers while sharing the same physical infras- 
tructure. Similarly, software virtual routers offer both 
programmability, reconfigurability, and have been shown 
to manage impressive throughput on commodity hard- 
ware (e.g. [5]). 

In the spirit of these and other testbed technologies, 
FlowVisor is designed to aid research by allowing mul- 
tiple projects to operate simultaneously, and in isolation, 
in realistic network environments. What distinguishes 
our approach is that we slice the hardware forwarding 
paths of unmodified commercial network gear. 

Supercharged PlanetLab [23] is a network experimen- 
tation platform designed around CPUs and NPUs (net- 
work processors). NPUs can provide high performance 
and isolation while allowing for sophisticated per-packet 
processing. In contrast, our work forgoes the ability 
to perform arbitrary per-packet computation in order to 
work on unmodified hardware. 

VLANs [4] are widely used for segmentation and iso- 
lation in networks. VLANs slice Ethernet L2 broadcast 
domains by decoupling virtual links from physical ports. 
This allows multiple virtual links to be multiplexed over 
a single virtual port (trunk mode), and it allows a sin- 
gle switch to be segmented into multiple, L2 broadcast 
networks. VLANs use a specific control logic (L2 for- 
warding and learning over a spanning tree). FlowVisor, 
on the other hand, allows users to define their own con- 
trol logic. It also supports a more flexible method for 
defining the traffic that is in a slice, and the way users 
opt in. For example, with FlowVisor a user could opt-in 
to two different slices, whereas with VLANs their traffic 
would all be allocated to a single slice at Layer 2. 

Perhaps the most similar to FlowVisor is the Prospero 
ATM Switch Divider Controller [24]. Prospero uses a 
hardware abstraction interface, Ariel, to allow multiple 
control planes to operate on the same data plane. While 
architecturally similar to our design, Prospero slices in- 
dividual ATM switches where FlowVisor has a central- 
ized view and can thus create a slice of the entire net- 
work. Further, Ariel provides the ability to match on 
ATM-related fields (e.g., VCI/VPI) where OpenFlow can 
match on any combination of 12-fields spanning layers 
one through four. This additional capability is critical 
for our notion of flow-level opt-in. 


8 Trade-offs and Caveats 


The Flow Visor approach is extremely general—it simply 
states that if we can insert a slicing layer between the 
control and data planes of switches and routers, then we 
can perform experiments in the production network. In 
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principle, the experimenter can exploit any capability of 
the data plane, so long as it is made available to them. 

Our prototype of FlowVisor is based on OpenFlow, 
which makes very few of the hardware capabilities 
available—which limits the flexibility. Most switch and 
router hardware can do a lot more than is exposed via 
OpenFlow (e.g. dozens of different packet scheduling 
policies, encapsulation into VLANs, VPNs, GRE tun- 
nels, MPLS, and so on). OpenFlow makes a trade-off: 
it only exposes a lowest common denominator that is 
present in all switches in return for a common vendor- 
agnostic interface. So far, this minimal set has met the 
needs of early experimenters—there appears to be a ba- 
sic set of primitive “plumbing” actions that are suffi- 
cient for a wide array of experiments, and over time we 
would expect the OpenFlow specification to evolve to be 
“just enough”, like the RISC instruction set in CPUs. In 
addition to the diverse set of experiments we have cre- 
ated, others have created experiments for data center net- 
work schemes (such as VL2 and Portland), new routing 
schemes, home network managers, mobility managers, 
and so on. 

However, there will always be experimenters who 
need more control over individual packets. They might 
want to use features of the hardware not exposed by 
OpenFlow; or they might want full programmatic con- 
trol, not available in any commercial hardware. The first 
case is a little easier to handle, because a switch or router 
manufacturer can expose more features to the experi- 
menter if they choose, either by vendor-specific exten- 
sions to OpenFlow and FlowVisor, or by allowing flows 
to be sent to a logical internal port that, in turn, processes 
the packets in a pre-defined box-specific way.” 

But if an experiment needs a way to modify packets 
arbitrarily, the researcher needs a different box. If the 
experiment calls for arbitrary processing in novel ways at 
every switch in the network, then OpenFlow is probably 
not the right interface, and our prototype is unlikely to 
be of much use. If the experiment only needs processing 
at some parts of the network (e.g. to do deep packet in- 
spection, or payload processing) then the researcher can 
route their flows through some number of special middle- 
boxes or way-points. The middle-boxes could be conven- 
tional servers, NPUs [23], programmable hardware [15], 
or custom hardware. The good thing is that these boxes 
can be placed anywhere, and the flows belonging to a 
slice can be routed through them - including all the flows 
from users who opt in. In the end, the value of Flow Visor 
to the researcher will depend on how many middle-boxes 
the experiment needs to be realistic—just a few and it 
may be worth it; if it needs hundreds or thousands then 
FlowVisor is providing very little value. 


For example, this is how some OpenFlow switches implement 
VPN tunnels today. 


A second limitation of our prototype is the ability to 
create arbitrary topologies. If a physical switch is to ap- 
pear multiple times in a slice’s topology (i.e. to create a 
virtual topology larger than the physical topology), there 
is currently no standardized way to do this. The hardware 
needs to allow packets to loop back, and pass through the 
switch multiple times. In fact, most—but not all—switch 
hardware allows this. At some later date we expect this 
will be exposed via OpenFlow, but in the meantime it 
remains a limitation. 


9 Conclusion 


Put bluntly, the problem with testbeds is that they are 
testbeds. If we could test new ideas at scale, with real 
users, traffic and topologies, without building a testbed, 
then life would be much simpler. Clearly this isn’t the 
case today: testbeds need to be built, maintained, and 
are expensive to deploy at scale. They become obsolete 
quickly, and many university machine rooms have out- 
dated testbed equipment lying around unused. 

By definition, a testbed is not the real network: there- 
fore, we try to embed testbeds into the network by slicing 
the hardware. This paper described our first attempt to- 
wards embedding a testbed in the network. While not 
yet bullet-proof, we believe that our approach of slicing 
the communication between the control and data planes 
shows promise. Our current implementation is limited to 
controlling the abstraction of the forwarding element ex- 
posed by OpenFlow. We believe that exposing more fine- 
grained control of the forwarding elements will allow 
us to solve the remaining isolation issues (e.g., device 
cpu)—ideally with the help of the broader community. If 
we can perfect isolation, then several good things hap- 
pen: researchers could validate their ideas at scale and 
with greater realism, the industry could perform safer 
quality assurance of new products, and finally, network 
operators could run multiple versions of the networks in 
parallel, allowing them to roll back to known good states. 
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Building Extensible Networks with Rule-Based Forwarding 


Lucian Popa* Norbert Egi* 


Abstract 


We present a network design that provides flexible 
and policy-compliant forwarding. Our proposal centers 
around a new architectural concept: that of packet rules. 
A tule is a simple if-then-else construct that describes 
the manner in which the network should — or should not 
— forward packets. A packet identifies the rule by which 
it is to be forwarded and routers forward each packet 
in accordance with its associated rule. Each packet rule 
is certified, guaranteeing that all parties involved in 
forwarding a packet agree with the packet’s rule. Packets 
containing uncertified rules are simply dropped in the 
network. We present the design, implementation and 
evaluation of a Rule-Based Forwarding (RBF) archi- 
tecture. We demonstrate flexibility by illustrating how 
RBF supports a variety of use cases including content 
caching, middlebox selection and DDoS protection. 
Using our prototype router implementation we show that 
the overhead RBF imposes is within the capabilities of 
modern network equipment. 


1 Introduction 


A central component of a network design is its forward- 
ing architecture that determines the manner in which 
packets are transported between two endpoints. Today’s 
Internet offers users a simple forwarding model: a user 
hands the network a packet with a destination address 
and the network makes a best-effort attempt to deliver 
the packet to the destination. Although simple, this archi- 
tecture is also fairly limited and there have been repeated 
calls to extend the Internet’s forwarding architecture for 
greater flexibility—allowing, for example, the user to se- 
lect the path his packets should traverse [20, 44, 47, 49] 
or to specify whether packets can/should be processed 
by middleboxes and active routers [47, 49, 29, 48, 25]. 

Achieving a flexible forwarding architecture has thus 
been a long-held, if elusive, goal of Internet research 
[47, 49, 29, 20, 48, 25, 40]. Our work in this paper shares 
this goal. Our point of divergence from prior efforts 
starts with the observation that forwarding flexibility is 
inherently coupled with issues of policy. 

Our thesis is that achieving flexibility is not just a 
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matter of augmenting packets with more expressive 
forwarding directives that routers execute. Rather, in ad- 
dition, for each forwarding directive that enhances flex- 
ibility, the parties involved in forwarding should be able 
to set policies that constrain that directive. By the policy 
of entity A (host, middlebox operator or ISP) we refer to 
the decision whether to approve or reject a forwarding 
directive based on A’s business or technical goals. By 
forwarding directive we refer to instructions provided by 
endpoints to routers and middleboxes on how to forward 
their packets. For example, a forwarding directive could 
specify that sender S can forward its packets through 
middlebox M before reaching destination D. An example 
of policy would be M refusing to accept packets from S. 


To better illustrate our thesis, consider its application to 
the Internet. Since the main forwarding directive in IP is 
for sender S to send packets to destination D, D should be 
able to specify that the traffic from S should not reach it, 
i.e., either by explicitly allowing or denying packets from 
D. Unfortunately, IP does not provide such functionality, 
effectively leaving the end-hosts vulnerable to DoS 
attacks. Unsurprisingly, this lack of functionality has 
been identified as one of the main security vulnerabilities 
of the Internet, and several solutions have been proposed 
to address this limitation [51, 52, 21, 32, 37, 22]. 

Of course, forwarding directives and policies are only 
as good as the ability of the network to enforce them and 
to guarantee their authenticity. What complicates policy 
enforcement is the involvement of multiple parties in 
achieving the packet’s flexible behavior—the network 
service providers along the path, potential middlebox 
operators and, of course, the source and destination. As 
such, the network must ensure that a packet’s forwarding 
directive complies with the policies of all parties in- 
volved. In our previous middlebox example, the network 
must ensure that M is willing to relay packets from S 
to D. If M does not approve, the network should simply 
drop the packets before reaching M. 


In this paper, we propose a new rule-based forwarding 
architecture, RBF, that treats flexibility and policy 
enforcement as equal design goals. RBF is based on a 
new architectural concept — that of packet rules. In RBF, 
instead of sending packets to a destination (IP) address, 
end-hosts send packets to a rule. Rules are created by 
destinations. A sender fetches the destination’s rule from 
a DNS-like infrastructure and inserts it in the packets 
sent to that destination. 
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A tule is a simple if-then-else construct that 
describes the manner in which the network should — or 
should not — forward packets. For example, a destination 
A can receive packets only from source S using the rule: 

Ra: if (pkt.source # S) drop pkt 


Or a mobile client B might route certain video content 
through a 3rd-party transcoding proxy with: 
Rp: if (pkt.URL = hulu.com) sendPktTo trnscdPrxy 


The above examples are anecdotal (we present precise 
syntax and additional examples in §3) but serve to 
illustrate how destinations can control and customize 
how the network forwards their packets in a manner not 
easily accommodated by current IP. In effect, with rules, 
a receiving host must specify both which packets it is 
willing to receive as well as how it wants these packets 
forwarded and processed by the network. 

The rule-based architecture we develop offers the 
following properties: 


Rules are mandatory: routers drop packets without 
rules 


Rules are provably authorized: all recipients (end- 
hosts, middleboxes and/or routers) named in the rule 
must explicitly agree to receive the associated packet(s). 
Routers, middleboxes and end-users can verify a rule’s 
authorization. 


Rules are provably safe: rules cannot exhaust net- 
work resources; e.g., rules cannot compromise or corrupt 
routers nor cause forwarding loops. 


Rules allow flexible forwarding: rules are a (con- 
strained) program that allows a user to “customize” how 
the network forwards its packets. 

The first two properties assist in policy enforcement by 
ensuring a packet is only forwarded if explicitly cleared 
by all recipients (i.e., if it conforms with the policies of 
all recipients) specified in the rule. Since RBF defines 
policies on rules, any recipient will have the ultimate say 
on whether to accept any rule that contains forwarding 
directives sending packets to it. Since all forwarding 
directives are encoded into rules, we achieve our goal of 
enabling any entity affected by a forwarding directive to 
constrain that directive. 

The third property ensures rules cannot be (mis)used to 
attack the network itself. As we shall show, the last prop- 
erty provides flexibility since users can give the network 
fine-grained instructions on how to handle their packets, 
enabling: explicit use of in-network functionality at 
middleboxes and routers, loose path forwarding, multi- 
path forwarding, anycast, multicast, mobility, filtering of 
undesired senders/ports/protocols, recording of on-path 
information, etc. In the remainder of this paper, we 
present the design, implementation and evaluation of a 
forwarding architecture that meets the above properties. 
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RBF relates to an extensive body of work on both for- 
warding flexibility and policy enforcement. We discuss 
related work in detail later in this paper and here only 
note that, at a high level, we believe what distinguishes 
RBF is its focus on simultaneously supporting flexibility 
and the multi-party policy requirements that such 
flexibility implies. As we shall see, this goal leads us to 
a design that differs significantly from prior proposals. 

Finally, we note up-front that RBF is more complex 
than the existing IP forwarding architecture, which is 
frequently cited for its simplicity. In addition, RBF 
relies on strong assumptions such as anti-spoofing, the 
existence of rule-certifying authorities and a DNS-like 
infrastructure to distribute rules. The gain, relative to 
today’s IP forwarding, is significantly improved flexibil- 
ity and security; we posit that the greater complexity of 
our solution is a perhaps inevitable consequence of this 
richer service model. 


2 Design Rationale and Overview 


We start with the goal of network flexibility and allowing 
users control over how the network processes their pack- 
ets. The abstraction that perhaps best supports flexibility 
is simply that of a program, leading to an architecture 
where users write packet-processing programs that 
routers execute. This vision of code-carrying packets is, 
of course, the cornerstone of active networking [48, 50] 
and we borrow this as our starting point in designing 
RBF. However, as we shall see, RBF severely dials 
back on the full-fledged generality of the original active 
networks’ vision to arrive at a significantly simpler and 
safer architecture. 

Rules are thus a form of program. The challenge then is 
to appropriately constrain these programs/rules to ensure 
that they cannot harm the network or other hosts. The key 
insight behind RBF is that these constraints must extend 
along two dimensions. First, rules must be safe, i.e., guar- 
anteed not to corrupt or exhaust network resources. In 
addition, however, we must constrain rules to respect 
the policies of all stakeholders involved—soutce, desti- 
nation, middleboxes and ISPs. This latter requirement is 
unique and yet critical to networking contexts but was 
under appreciated in early active networking proposals. 

To address policy safety, RBF incorporates two key 
design decisions: 

(D1) Layering: we believe network operators will be 
unwilling to relinquish control of route discovery and 
computation and hence we layer RBF above current IP 
forwarding and do not allow rules to modify the IP-layer 
forwarding information base (FIB). 

(D2) Verifiable stakeholder agreement: we require 
that a rule be authorized by all entities it explicitly 
names (e.g., destination, middleboxes or routers). This 
ensures agreement of the stakeholders’ policies with 
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the rule’s intent; in particular it also ensures that rules 
cannot violate ISPs’ routing policies, since providers 
must explicitly agree to have their routers named in 
rules. To achieve this property, in RBF rules are certified 
by trusted third parties, which in turn gather proofs of 
policy compliance from each of the rule participants. 

To address rule safety, we impose strict constraints 

on rule syntax, such that safety can be verified through 
simple static analysis: 
(C1) Rules cannot directly modify router state. This 
avoids corruption of router state. However, this can be 
a limiting restriction, particularly to network operators 
who wish to expose in-network services such as caching 
or monitoring to end users. To accommodate this, RBF 
allows operators to deploy specialized packet-processing 
functions at their routers and allows rules to invoke these 
functions. Such “router-defined functions” do allow rules 
to update router state, but only indirectly via code in- 
stalled, and hence presumably trusted, by operators. This 
model for router-defined functions thus represents a mid- 
dle ground in the tradeoff between flexibility and safety. 
(C2) The rule “instruction set’ is limited to only four 
possible action statements: (a) forward the packet to 
the underlying IP layer, (b) invoke a router-defined 
function, (c) modify the packet header and (d) drop 
the packet, plus conditionals that determine whether an 
action should be taken based on reading packet headers 
and router state. Note that there is no action that allows 
backward jumps across rule statements. This prevents 
looping or resource exhaustion at routers and ensures 
execution time is linear in program size. 

The above constraints represent a stark departure from 
the rich generality of the active networks vision. Indeed, 
rules are more a sequence of packet steering directives, 
rather than a full-fledged program. The benefit is ver- 
ifiable rule and policy safety. Moreover we find that, 
despite these constraints, rules suffice to express a wide 
variety of forwarding behaviors as we will later illustrate. 


2.1 Architecture Overview and Assumptions 


We now provide a brief overview of the main compo- 
nents and assumptions of an RBF architecture. Figure | 
illustrates the forwarding architecture of an RBF-enabled 
router. On receiving a packet, the router hands it to the 
rule forwarding engine, which processes the packet’s 
rule. Such processing may involve reading router state 
that the network operator has opted to expose; we term 
such state router attributes. Based on information in the 
packet header (packet attributes) and router attributes, 
the rule forwarding engine may update the packet’s 
attributes (including its destination), invoke router 
functions, drop the packet and/or hand the packet to the 
underlying IP forwarding engine. Recall that for safety 
reasons the rule is not allowed to update router state. 
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Figure 1: RBF router and rule forwarding 


The design of a rule-based architecture involves the 
design of rules themselves as well as the surrounding 
infrastructure required to support the distribution, 
processing and securing of rules. Consequently, the RBF 
architecture consists of four main components: 


e The specification of packet rules — their syntax, packet 
encoding, constraints on what rules can and cannot do. 

e Certificate authorities called Rule Certification Enti- 
ties (RCEs) that certify rules after checking that they 
are well formed, and that every destination specified 
in the rule agrees with (i.e., has signed) the rule. 

e Modified IP routers that verify rule certificates and 
process packets as described above. 

e A modified DNS infrastructure that either directly 
resolves a host D’s domain name to D’s rule, or 
resolves D’s domain name to another rule resolution 
server which in turn provides D’s rule. 


Assumptions: RBF builds on three major assumptions. 

First, RBF assumes the existence of an anti-spoofing 
mechanism. This is required because rules may use 
source and destination IP addresses in their decision pro- 
cess and hence addresses must be legitimate, otherwise 
policy compliance cannot be enforced.! In this paper we 
assume the use of ingress filtering, although RBF can 
accommodate alternate solutions, e.g., Passports [36]. 
The rationale behind our choice of ingress filtering is 
described in 84. 

Second, we assume routers know the public keys of 
RCEs and can thus verify rule certificates. We assume 
the number of RCE organizations is relatively small and 
these keys can be statically configured at routers, akin to 


‘Note that any solution for blocking undesired traffic inside 
the network requires a way to identify sources. Anti-spoofing 
identifies users based on their addresses. An alternative, is to 
identify users by their access path [51, 52], but this approach 
ties communications to a specific path restricting flexibility 
(e.g., for mobility, traffic engineering, multi-path forwarding). 
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how browsers today are configured with the list of major 
certificate authorities.” Note that although we assume a 
small number of RCE organizations, we envisage each 
organization will run geographically replicated instances 
of their service for improved scalability and robustness. 
Finally, we assume that the rule resolution infras- 
tructure (whether DNS or the resolution servers the 
DNS points to) is well provisioned, akin to how major 
Internet services (Google, DNS, Amazon) operate today, 
relying on engineering approaches such as maintaining 
a presence at major ISPs, IP anycasting, bandwidth pro- 
visioning, and so forth. As described in §4, we make this 
assumption to protect against “denial of rule” attacks. 
Clearly, these assumptions are significant and may im- 
pede an immediate deployment of RBF in practice. And 
even with these assumptions, the resulting RBF design is 
far from trivial (for this reason, we in fact offload some 
of the details to an extended technical report [42]). How- 
ever, we hope through the design presented in this paper 
to start a focused discussion about how best to practi- 
cally introduce flexibility and security into the Internet 
and about what set of primitives routers must support to 
achieve this goal. In this paper we present one solution to 
this problem; in §10 we succinctly discuss the arguments 
that have led us to these specific assumptions and design. 


3 The RBF Data Plane 


In this section we describe the key components of the 
RBF data plane: rule syntax and how routers verify and 
execute rules. We then present examples of how rules 
are used. 


3.1 Rule Specification 


RBF represents a rule as a sequence of actions that can 
be conditioned by if-then-else instructions: 

if (<CONDITION>) ACTION1 

else ACTION2 
Conditions are comparison operators applied to packet 
and router attributes. An action can be one of: 


1. forward the packet to the underlying IP engine; 
2. invoke a local function available at the router; 
3. update the value of the packet attributes; 

4. drop the packet. 


Packet attributes include the standard IP header five-tuple 
(IP addresses, ports, protocol type) and, optionally, a 
number of custom attributes with user-defined semantics. 
For simplicity, RBF does not allow rules to dynamically 
add new attributes. Router attributes may include, for 
example, the router’s IP address, AS number, link 
congestion levels, and flags indicating whether the router 


Some may regard this model of security unsatisfactory, we 
discuss alternatives to this deployment in §4. 
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implements a specific function (e.g., a rule can check 
router.local_cache to discover whether the router 
maintains a local content cache). Rules are allowed to 
update packet attributes, but not router attributes. 


Each rule has an associated /ease that ensures the rule 
can only be used for a limited period of time (84.3). 
Also, every rule has an identifier (ID) defined as the con- 
catenation of a hash of the rule owner’s public key and 
an index unique to the owner, hash(PK_owner):index. In 
Section 7 we present an optimization to reduce packet 
overhead and identify most rules by using a hash over 
their content. This optimization can be used in the 
common case when there is no need for multiple rules 
with the same identifier; for example, mobile hosts may 
require different rules with the same identifier (see §3.4). 


The following is an example of a rule that forwards 
a packet to destination D via a waypoint router R1; 
a packet attribute visitedR1 indicates whether the 
packet has already visited R1: 


R_D: 
if (packet. visitedR1 == FALSE) //from src. to Ri 
if (router.address != R1) 
sendto R1 
else packet.visitedR1 = TRUE //to D 


if (packet. visitedR1 ) 
sendto D 


where sendto involves setting the IP destination ad- 
dress to D and then handing the packet to the underlying 
IP forwarding engine (assuming, of course, that D is 
not the local address). Rule execution terminates at a 
sendto or drop action; the packet is dropped if the 
rule does not arrive at an explicit sendto. Finally, 
rules can invoke local functions at the router; after the 
invocation the packet is returned to the forwarding layer. 


3.2. Distributing Rules to Routers 


To forward a packet, a router must first obtain its rule. 
There are two potential approaches: (1) rules are carried 
in packets, (2) routers use an out-of-band mechanism to 
obtain rules. In RBF, we choose to carry rules in packets 
since the second approach would require complex rule 
distribution and storage protocols, and would incur extra 
delays in communication setup (in fact this approach 
would likely require special “rule-less” traffic to install 
rules). The tradeoff is higher overhead on the data path 
as rules increase packet size and routers must verify each 
packet’s certificate; our evaluation in §7 suggests this 
overhead is acceptable given the capabilities of modern 
network equipment. 


A packet with source S and destination D must include 
a destination rule, RD, which is the rule specified and 
owned by D. In addition, a packet may include a return 
rule; this is the rule specified and owned by S and is 
used for return traffic from D to S. 
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3.3. Rule Verification 


As mentioned earlier, rules are certified by a Rule Cer- 
tification Entity (RCE) and all packets carry a signature 
that routers must verify. The verification load at routers 
is eased by two factors. First, only routers at trust bound- 
aries need to verify rules. Second, routers can cache ver- 
ification results by maintaining a hash of the rule and its 
signature. With caching, the full signature verification is 
only required for the first packet forwarded on a new rule 
(as long as the verification result is cached). Thus, verifi- 
cations can be limited only to border routers and, assum- 
ing a large enough cache, the verification rate is given by 
the arrival rate of packets with new rules. By contrast, 
the signature length adds to the overhead of every packet. 
Different cryptographic solutions offer different trade- 
offs between signature length, signing time (incurred 
only at RCEs), verification time (incurred at routers) 
and security. Our current RBF design assumes Elliptical 
Curve Cryptography (ECC) because ECC signatures are 
shorter than RSA ones, while exhibiting similar security 
properties. At the same time, verification time in ECC 
is typically longer than RSA’s. However, in practice 
verification can be accelerated using ASIC-based im- 
plementations or dedicated specialized co-processors. 
Such implementations are already commercially avail- 
able [5, 7, 8] and incorporated into network appliances 
and routers. Furthermore, traffic measurements [4] show 
that new flow arrivals represent less than 1% of the 
link capacity on average and less than 5% of the total 
number of packets, a volume that can be accommodated 
using commercial ECC modules [5, 7] or recent research 
proposals [53, 34]. We evaluate different signature 
mechanisms briefly in § 7 and in greater detail in [42]. 


3.4 Examples of RBF usage 


To illustrate the application of rules, we present a series 
of example usage scenarios; the rule syntax in these 
examples is largely identical to the high-level rule 
language supported by our RBF prototype router (86), 
with simplifications for readability as appropriate. 


Port-based filtering: A web server, D, uses the following 
simple rule to ensure it only receives packets on port 80: 


R-filter_port: 
if (packet.dst-port != 80) drop; 
sendto D 


Middlebox Support: In addition to accepting traffic di- 
rectly on port 80, D might use the following rule to 
route all other incoming traffic through a packet scrub- 
ber [2, 6]. This functionality can be deployed either by 
D’s provider (as a router function), or by a third party (at 
a middlebox Scrb) as presented below: 


’Note that Scrb can represent the address of a load 
balancer used with several physical middleboxes. 


R-mbox-port: 
if (packet. dst_port == 80) 


sendto D // directly to D 
else 
if (packet.scrubbed == FALSE) //before scrubber 
if(router.address != Scrb) 
sendto Scrb 
else //at scrubber 
packet.scrubbed = TRUE //mark scrubbed 
invoke Scrb-_service // scrub 
else 
sendto D // after scrubber 


Thus, similar to previous proposals [47, 49], RBF 
provides explicit support for middleboxes such as 
WAN optimizers, proxies, caches, encryption engines, 
transcoders, SSL offloaders, intrusion detection, etc. 


Secure Middlebox Traversal: In the previous example, 
an attacker can directly send a packet with the attribute 
values set so as to appear that the packet has already vis- 
ited the middlebox. More generally, one should be able 
to enforce that rule directives are respected when the rule 
participants (sources, middleboxes) are not trusted. 

One approach to protect against this behavior is to 
leverage RBF’s assumption that sources cannot spoof 
their addresses. More specifically, after each middlebox 
the rule can verify that the packet has indeed been sent by 
the required middlebox, since middleboxes/waypoints 
need to set the (non-spoofable) source address attribute 
in packets (for brevity we omit this in the presented 
examples); see [42] for more details on this approach. 

In an alternate approach, special cryptographic func- 
tions deployed at middleboxes and destinations can be 
used to create/verify proofs guaranteeing the packet has 
visited the middlebox, as follows: 


R_mbox-port_crypto: 
if (packet. dst_port == 80) 


sendto D // directly to D 
else 
if (packet.proven == FALSE) 
if(router.address != Scrb) //before scrubber 
sendto Scrb 
else //at scrubber 


if (packet .scrubbed == FALSE) 
packet.scrubbed = TRUE 
invoke Scrb-service //(1) scrub 
else // scrubbed 
packet.proven = TRUE 
invoke Prove //(2) create proof 
else // proven 
if(router.address != D) 
sendto D 
else 
invoke VerifyAndDeliver //check proof at D 
In this example, the Prove function at the middlebox 
signs the immutable part of the packet header and/or 
payload, and adds this signature as an attribute to the 
packet header. In turn, the VerifyAndDeliver 
function at D checks the middlebox signature and, if the 
check succeeds, delivers the packet to the end applica- 
tion. Note that checking the signature requires that D 
knows the public or shared key(s) of middlebox(es); for 
efficiency, the middlebox could sign the hash chain of a 


batch of packets. 
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DoS Protection: To protect against DDoS attacks, a 
server D can create a custom rule for each client that 
drops packets from any source other than the client. By 
controlling the number of rules active at a given time, 
D controls the maximum number of active clients (each 
rule has an associated lease period). An example of a rule 
similar to a network capability [52, 51] is: 
R-filter_src: 
if (packet.source != requester -_IP) 
drop; 
...//rest of the rule 

Similarly to capability based architectures [52, 51], 
our solution is based on the premise that destinations 
are able to grant rules on demand, and that any requester 
can ask for a destination’s rule. In RBF, this task falls 
to the rule resolution infrastructure and raises the pos- 
sibility of a “denial of rule” attack on this infrastructure 
(akin to denial-of-capability attacks in capability-based 
systems[41]). We present the details of rule resolution 
and discuss denial-of-rule attacks in §4. 


Mobility: Host D changes its network IP address due 
to physical movement. In RBF, D can continue an exist- 
ing communication without having to re-establish it. To 
achieve this, D creates a rule for the new address with the 
same ID as the rule used in the existing communication, 
and places it in the packet as the return rule. 


Multicast: For security reasons, RBF does not support 
packet replication, and thus multicast cannot be imple- 
mented entirely at the RBF layer. Instead, multicast can 
be implemented by invoking multicast functionality de- 
ployed by ISPs at a subset of their routers; this function- 
ality maintains (soft) state at routers to create a (reverse 
path) multicast tree. This approach implements essen- 
tially an overlay multicast solution, which leverages the 
IP multicast functionality at on-path routers (see [42] for 
details). 


On-path Caching: Consider an ISP I that deploys 
caching functionality at some of its (border) routers. A 
web-service D can contract with I and use this function- 
ality. For this purpose, D creates and publishes the fol- 
lowing rule: 


R-caching: 
if (router.caching-available and 
packet.crt_router != router.address) 
packet.crt_router = router.address 
invoke Caching 
sendto D 
where the crt_router attribute makes sure the 
caching functionality is called just once at each 


caching-enhanced router. 


In this example, the caching functionality can decide 
to respond to the requester directly and not forward 
the packets further to D, which reduces latency for the 
requester and traffic load at D. A similar rule can support 
recent proposals for content-centric routing [35, 33]. 
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Other Examples: Our technical report [42] provides ex- 
amples of applying RBF to a range of additional ap- 
plications, including: secure loose path forwarding [44, 
40], multipath forwarding, network diagnostics, anycast, 
reverse traceroute (path recording), delay-tolerant net- 
working and even source control over middlebox or path 
selection. Importantly, these individual examples can be 
combined as needed. For example, a content distribution 
network can distribute load among multiple sites using 
anycast and, at the same time, protect its servers with on- 
path IDS functionality provided by ISPs. 


4 The RBF Control Plane 


In RBF, ISPs provide their clients with rules to access 
the local DNS server and a Rule Certification Entity 
(RCE), which can certify clients’ rules. This information 
can be provided through a modified DHCP service, 
similar to the way ISPs or organizations provide the IP 
address of DNS servers today. 

In this section, we describe the RBF mechanisms for 
rule creation and certification (§4.1), rule distribution 
($4.2), lease enforcement ($4.3) and anti-spoofing (84.4). 


4.1 Rule Creation and Certification 


To receive traffic, a client must create a rule that allows 
one or more sources to send traffic to it. Before distribut- 
ing this rule, the client must ask an RCE to certify it. 
RCE certification guarantees that rules obey the policies 
of all stakeholders. In particular, certification guarantees 
the following properties: 


1. Every destination in the rule (i.e., any address that 
appears as an argument of a sendto instruction) has 
agreed to receive packets using that rule; 


2. The operators providing router functions invoked by 
the rule approve the rule behavior; 


3. The rule cannot cause infinite loops; 


4. The rule cannot bypass ISP routing policies. 


A client can either create rules itself and directly ask 
an RCE to certify these rules, or use a trusted DHCP-like 
service to create and certify rules on its behalf. In the 
remainder of this section we present the former case. 

As described above, the ISP provides each client with 
a rule to access an RCE that has a contract with the ISP. 
The following example shows a possible rule that allows 
a client D to access an RCE named C: 


Rp—c: if (source == D) sendto C 


Before certifying a rule, an RCE verifies that the rule 
has been authorized by each destination that appears in 
the rule. A client who has created a rule authorizes it by 
simply signing the rule with its private key. A client that 
appears in the rule as a destination, other than the rule’s 
creator, will first verify that the content of the rule obeys 
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Figure 2: Rule Certification 


its policies before signing the rule. For example, an 
intrusion detection box may verify that the destination 
indeed belongs to a client allowed to use the service 
(e.g., based on a contract between the client and the 
provider of the intrusion detection service), a waypoint 
router may verify that the final destination is allowed to 
use source-routing, efc. 

Let (Kp, K aD denote the (public, private) key pair 
of client D, and let IP p be the IP address of D. To prove 
to an RCE that the client signing the rule with private 
key Ke indeed owns IP address IP p, client D sends 
a certificate along with the signed rule that binds its 
public key Kp and IP address IPp. This certificate is 
signed by an entity T, i.e., [I Pp, Kolko where kK! 
represents the private key of T. Clearly, the RCE must 
trust entity T. In fact, in our solution we will assume that 
T is itself an RCE. 

Next, we present the rule certification process in detail, 
initially for the case in which the rule has a single 
destination, and then for the case in which the rule has 
multiple destinations or waypoints/middleboxes. 
Certify single-destination rules: Assume destination D 
wishes to certify a rule R that forwards packets only to 
its address IPp, e.g.,R: sendto IPp. Also, assume D 
already has a rule Rp on which it can be reached by the 
RCE C. D obtains this rule as part of the bootstrapping 
process, which we discuss later. 

Fig. 2 shows the certification of D’s rule, R, by C: 


1. Host D signs rule R with its private key, and sends 
it to C using rule Rp.c. In addition, D sends 
the certificate binding its public key and address, 
i.e., [I[Pp, Kol\qo Upon receiving this request, C 
verifies the certificate as well as the signature of the 
requested rule. These ensure that the request has been 
made by the owner of Kp and that the requester is 
also the owner of IP :p. In addition, C verifies that R 
is well formed (see §5). 


2. If rule verification succeeds, C signs the rule with its 
private key and sends it back to D using the return 
rule in its certification request, Rp. At this point, host 
D can distribute rule R to other hosts directly (as a 
return rule) or through DNS. 


The certification procedure (Fig. 2) needs only to 
guarantee the authenticity of the request. Since rules are 
public, confidentiality is not a concern. Since the lease is 


an absolute value (84.3), the only effect of replaying rule 
requests is increased traffic at the RCE. The maximum 
lease value that C can sign for a rule is negotiated 
between D’s ISP and C. Furthermore, RCEs can limit the 
number of clients contacting them and can limit each 
user’s certification rate, as we discuss in this section. 


Certify multiple destination rules: In this case, every 
destination (i.e., any host, middlebox, or waypoint router 
that appears as an argument of a sendto instruction) in 
a rule must agree to receive packets on that rule, i.e., the 
rule must respect its policies. In particular, every such 
destination must sign the rule. One of the destinations, D, 
collects the signatures of all the other destinations along 
with their certificates binding their public keys to their 
addresses. D then sends this information to its RCE. In 
turn, the RCE verifies that all destinations in the rule 
have signed the rule and sends the signed rule back to 
D. The lease signed by the RCE has the minimum du- 
ration between the requested lease and the leases of all 
the certificates binding the addresses and the keys of the 
participants. 


Certify rules invoking functions: Operators providing 
router functions can restrict which rules can invoke these 
functions. The certification process is similar to certify- 
ing multiple destination rules. The identifiers of func- 
tions whose invocation requires authorization are repre- 
sented as hashes of public keys. RCEs certify a rule con- 
taining such an invocation only if the rule is signed with 
the private key corresponding to the function identifier. 


Bootstrapping: To certify rules, client D needs to (1) 
know the rule to contact an RCE, C; (2) provide C with a 
return rule to receive the certified rule; and (3) obtain the 
certificate from a trusted authority that signs the binding 
between D’s key (Xp and its address IPp. We assume the 
ISP provides D with a rule to access an RCE C (similarly 
to how ISPs today bootstrap clients’ access to the DNS). 
Given this initial rule, we use a simple request-response 
exchange between the client and the RCE to obtain both 
the certificate binding the client’s IP address to its key as 
well as its first rule. Due to space constraints, we refer 
the reader to our extended technical report [42] for more 
details on the bootstrapping process. 


RCE load and availability: To control its certification 
load, an RCE can rate-limit the number of certification 
requests that it processes from each individual client. 
Clients are identified by IP address; the anti-spoofing 
mechanism prevents clients from impersonating each 
other. Alternatively, clients can be identified by “person- 
alized” rules provided by the ISP to the customer to ac- 
cess the RCE; such rules may have a finer granularity 
than the anti-spoofing mechanism. RCEs can indirectly 
protect themselves against link-level DoS attacks by con- 
trolling the number of clients under contract. 
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(a) Regular Distribution 
Figure 3: Rule Distribution (solid lines = rule lookup pro- 
cess; dashed lines = data communication; dotted lines = setup) 


(b) DDoS protection 


RCEs must be highly available to enable rule certifi- 
cation at any time. RCEs can meet this requirement by 
using multiple servers and multiple sites. ISPs and desti- 
nations can protect themselves against RCE unavailabil- 
ity by contracting with multiple RCEs. 


RCE Key Distribution and Revocation: In this paper 
we do not explore solutions for the distribution and revo- 
cation of RCE keys to routers. Here, we simply mention 
two possible approaches towards this goal. In one ap- 
proach, RCE keys could be distributed and revoked using 
DNSSEC. For example, in the txt or other RR type, one 
DNS entry contains the number of RCEs and, for each 
RCE, there is one DNS entry (based on its index such as 
“ID24.rce”) that contains the RCE’s key. Routers period- 
ically update the RCE keys. In another approach, RCEs 
could be deployed along AS boundaries, such that each 
AS would have its own RCE. This approach has the ad- 
vantage that additional security can be enforced, e.g., the 
trust in some RCEs can be restricted to their own address 
ranges. Secure BGP could be used to distribute RCE keys 
in this case, but at the expense of extra complexity.* 


4.2 Rule Distribution 


RBF uses an extended DNS infrastructure to distribute 
tules, as illustrated in Fig. 3(a). The destination D creates 
and certifies a rule for itself (step A) and inserts it into 
the DNS (step B). A sender S that wants to contact D 
looks up D’s name in the DNS; the DNS is extended 
to return D’s rule rather than its address (step 1). After 
obtaining a rule to D, S directly sends packets to D (step 
2). Note that for practical purposes the rules of the DNS 
root servers need to have long leases (to avoid tedious 
reconfiguration or refresh protocols), as with today’s 
long-lived addresses. 

In Section 3.4 we pointed out that rules can be used 
to block DDoS attacks. This relies on (1) the ability to 
distribute customized rules to different senders (i.e., give 
a sender S arule that drops all packets not generated by 
S) and on (2) the ability to protect the rule distribution 
itself from DoS attacks. 

To protect against DDoS attacks, client D can contract 
with a large entity E, and redirect its DNS entry to E, 
by registering E’s rule under its DNS name. Fig. 3(b) 


‘Note that DNSSEC could also potentially be used to 
distribute keys when RCEs are deployed along AS boundaries. 
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illustrates this approach. DNS will reply to a lookup 
for D’s name with E’s rule (step 1). The DNS entry that 
contains E’s rule must belong to a new type of DNS RR. 
This new class of entries is returned directly to clients by 
DNS resolvers. Upon a receipt of such an answer to its 
DNS query, the requester will continue the DNS lookup 
by contacting E (step 2). E rate-limits rule requests and 
forwards them to D (step 3), thus protecting D from DoS 
attacks. For the authorized requesters, D creates rules 
(step 4) and replies back to the requesters (step 5). E 
forwards requests to D conforming to a policy (see §3.4), 
which can be updated by D at any time. 

Note that some malicious users may still get their re- 
quests forwarded by E and authorized by D. To alleviate 
this attack, E can employ fair queuing across senders, 
and D can blacklist known attackers at E. Such an ap- 
proach offers a protection similar to network capabilities 
that apply per-source fair queuing at routers [37]. 


4.3 Rule Leases 


The lease is an expiration time stamp certified along 
with the rule description. A router drops a packet if 
its current time exceeds the rule expiration time. For 
simplicity, in this paper we assume that all routers and 
RCEs are synchronized via NTP [14] as recommended 
by router manufacturers [19]. We present a solution that 
does not rely on global clock synchronization in [42]. 


4.4 Anti-Spoofing Mechanism 


If a source can spoof addresses on packets it sends, it can 
send packets to a destination D even if the rule does not 
allow it to, and in this way evade D’s policy. Moreover, 
one can mount a DDoS attack by using a single rule 
distributed by a malicious source to a set of colluders. To 
address this problem, RBF can use a previously proposed 
anti-spoofing mechanism. In this paper, we propose the 
use of ingress filtering, which is already deployed by 
over 75% of today’s ASes [23]. When deploying RBF, 
RBF routers could also be used to apply ingress filtering. 
Note that if malicious ASes do not apply ingress filter- 
ing, DoS protection is not fully compromised as only 
hosts in these ASes can launch attacks. 

Instead of ingress filtering, RBF could leverage other 
anti-spoofing mechanisms such as Passport [36]. How- 
ever, Passport [36] requires a secure routing layer and 
incurs extra overhead in packets. 

The anti-spoofing mechanism requires middleboxes 
and routers that change a packet’s destination address 
also to change the packet’s source address attribute. 


5 Security Analysis 


The RBF design aims to achieve the following three 
goals: (i) policy enforcement — ensure that the authorized 
rules respect the policies of all participants (routers, mid- 
dleboxes, destinations), and packets with unauthorized 
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rules are dropped inside the network; (ii) rule enforce- 
ment - rules cannot be used by malicous senders and, if 
senders or rule participants are untrusted, respect of rule 
directives can be enforced; and (iii) rule safety rules 
cannot be used to attack the network. Next, we sum- 
marize RBF’s security properties, the threat model and 
assumptions under which they hold, and the mechanisms 
that allow RBF to meet these goals. We present a detailed 
analysis and proofs of RBF’s security properties in [42]. 


Assumptions: We assume that DNS resolution is secure, 
that distribution of RCE keys to routers is secure, and that 
RCEs are not malicious. 


Attackers: An attacker in RBF can be any host, middle- 
box, or router: sources can attempt to attack destinations 
by forging, evading or tampering with their rules; des- 
tinations can try to attack the network by creating rules 
that waste resources and slow down routers; middleboxes 
and routers can attempt both of these attacks. 


Security Properties: We decompose the aforementioned 
security goals into four specific desired properties: 


1. No Rule Forging: A host S cannot manufacture a 
rule that sends packets to another host D, unless D 
explicitly agrees with this rule, i.e., destinations and 
middleboxes control the creation of rules that send 
traffic to them. 


2. No Rule Tampering: Sources, routers and middle- 
boxes cannot tamper with the destination’s rules. 


3. No Rule Evasion: Host S cannot send packets to des- 
tination D, if D’s rules do not accept packets from S. 


4. Network Safety: A destination D cannot create 
unsafe rules. In particular, D cannot create rules that 
(a) cause infinite loops, (b) corrupt router state, (c) 
DoS routers or RCEs, or (d) violate ISP policies. 


Mechanisms and Defenses: RBF uses four mechanisms 
to achieve the above properties: (1) rule certification, (2) 
tule leases, (3) anti-spoofing, and (4) static analysis. Ta- 
ble 1 summarizes which mechanisms serve to meet the 
four security properties. 


6 Implementation 


This section describes our prototype RBF router and 
rule compiler. 


6.1 An RBF Rule Compiler 


Our prototype offers users a high-level language largely 
identical to the syntax used in this paper in which to write 
rules. We wrote an RBF compiler in C++ that translates 
this high-level language into a compact rule format 
carried in packets. This compact format uses: 8B(ytes) 
for public-key hashes, 3B for the user-local index, 3B to 
identify the RCE, 3B to identify router-defined functions 
that do not require approval to be invoked and 8B for 
those that do, and 2B as the default RBF packet attribute 
values.° For the lease we use an absolute expiration time 
consisting of first 4B of the NTP format, with second- 
level granularity and a wrap-around period of 136 
years. For efficiency, we use variable-length encoding in 
representing the internal rule structure. The maximum 
rule description size is 256B in our implementation. 


6.2 A Prototype RBF Router 


Rationale: We implemented RBF forwarding using 
Click [39] and RouteBricks [26]. Most commercial 
routers implement packet processing using ASICs 
or specialized network processors (NPs) rather than 
general-purpose CPUs and, as such, our software-based 
prototype is not entirely representative of currently de- 
ployed routers. To a large extent, our choice of proto- 
typing platform is borne of necessity since commercial 
routers are closed. Beyond necessity, however, we be- 
lieve a software-based prototype is valuable for mul- 
tiple reasons. First, recent research [26, 31, 27] has 
demonstrated that, with modern multi-core servers, it is 
now possible to build high-speed software routers up to 
edge and even core speeds. Secondly, while not directly 
reusable, several aspects of our implementation archi- 
tecture such as our approach to partitioning tasks across 
multiple cores should apply to network processor-based 
routers. Finally, several research [12, 28] and commercial 
switches [3] augment ASIC-based switches with some 
number of co-located general-purpose cores or servers 
for greater flexibility in packet processing — our proto- 
type architecture is directly applicable to such platforms. 


Design requirements: We build our prototype in the 
context of modern multi-core servers that incorporate 
multiple processors or “sockets”, each with multiple 
cores [17, 1]. As shown in Fig. 1, the software stack of an 
RBF router includes the following key components: (1) 
an IP forwarding module, (2) the rule execution engine, 
and (3) some (possibly zero) number of specialized for- 
warding function modules. All packets traverse the rule 
execution and IP forwarding components, while different 
subsets of packets may traverse one or more specialized 
functions. In addition, the resources required to process 
a packet may vary widely across functions; e.g., an en- 


Our current prototype only supports this default size. 
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Figure 4: Core Allocation Example in RBF Router 


cryption function would use lots of CPU but little cache, 
while a caching module may use more cache and less 
CPU. At a high level, our design goal is to balance high 
performance (i.e., making efficient use of resources) with 
performance isolation, both across different functions, 
and between functions and the rule execution engine 
(i.e., sharing resources in a fair manner). 


Approach: In its full generality, the above goal re- 
quires contention-aware scheduling that simultaneously 
takes into account the multiple resources (cores, various 
caches, memory bandwidth, I/O bandwidth) for which 
tasks might contend. For modern multi-core systems, this 
is in itself an area of active research [24, 54] and be- 
yond the scope of this paper. Instead, in our prototype, 
we address the issue as follows. The IP forwarding mod- 
ule and the rule execution engine are the central, most 
critical, components of the router and hence we assign 
these to a socket of their own and do not run special- 
ized functions at cores in this socket. This avoids having 
the IP and rule execution engines contend with special- 
ized functions for cache, CPU and other resources at the 
cost of some potential inefficiency since these “reserved” 
cores (if unused) cannot be used by specialized functions 
(if needed). We then assign specialized functions to the 
remaining “unreserved” cores. We rely on the existing 
(Click and Linux in our implementation) system sched- 
ulers to ensure fair sharing of CPU resources between 
functions on the same core. 

To achieve high performance, we run a single thread 
performing both IP forwarding and rule execution at 
each of the reserved cores; this ensures that packets that 
do not invoke any specialized functions are processed 
entirely by a single core avoiding potentially expensive 
cache misses and inter-core synchronization [26]. Pack- 
ets that invoke specialized functions must be relayed 
across cores and hence incur corresponding performance 
overheads due to cache misses and so forth. To improve 
the efficiency of such transfers when these functions 
are implemented in user space, we use shared memory 
pages and event queues. In our current prototype, when 
a rule invokes a user-level function, we make a single 
copy of the packet to the shared memory. An example of 
the resulting system architecture is depicted in Fig. 4. 
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Figure 5: Rule Sizes 


7 Evaluation 


We use our prototype to evaluate the overhead RBF im- 
poses on packets (§7.1), routers (§7.2) and RCEs (87.4). 


7.1 Packet Size Overhead 


Fig. 5 presents rule sizes (in bytes) for a range of exam- 
ples, including those from §3.4. The figure captures all 
the RBF-related fields and presents the size broken down 
into (a) the rule and the associated attributes’ binary 
encoding; (b) the control fields used for the lease, RCE 
identification, to specify whether the return rule is in the 
packet and so forth; and (c) the rule signature. We assume 
a 41B signature obtained using ECDSA with ECC public 
keys for RCEs derived from the NIST B-163 or K-163 
curves [18], offering 80 bits of security. Note that RBF is 
independent of the exact signature scheme used and that 
smaller (and faster) signatures can be used. However, 
shorter RCE keys may require more frequent updates to 
compensate for the lower security guarantees. The rules 
in Fig. 5 do not contain an identifier, and are identified 
by endpoints and routers using a hash over their content. 
Rule identifiers are required for rules whose content 
may change during a communication (such as the rules 
of mobile hosts) and incurs an additional 11B overhead 
in our implementation (8B for the hash of the public key 
and 3B for the user-selected index). Note that the rule 
identifier need be unique only with respect to a single 
communication endpoint (i.e., all parties that a host X 
communicates with should have unique rule identifiers). 

From Fig. 5 we can see that many common forwarding 
scenarios (unicast, routing via middleboxes, rules for 
DoS protection) can be expressed with around 60-80B 
rules while more complex rules (e.g., loose source rout- 
ing, secure middleboxes, anycast) can take as much as 
140B. The average rule size across all examples we have 
implemented is 85B, representing 13% overhead for an 
average packet of 630B[4] and 6% overhead for a 1500B 
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packet. By comparison, using RSA-1024 signatures 
(instead of ECDSA) would incur 27% overhead on a 
630B packet and 11% overhead on a 1500B packet. 


Potential Optimization - Rule Caching: Per-packet 
overhead can be significantly reduced by caching rules 
at endpoints and routers; packets whose rules have been 
cached need only carry rule identifiers. There are two op- 
portunities for caching. First, destinations can cache re- 
turn rules; this allows the return rule to be eliminated 
from all but the first packet in a source-to-destination 
exchange. Second, rules can also be cached at routers. 
Here, however, we must ensure no packet carrying only 
a rule identifier arrives at a router that does not store the 
corresponding rule description. This might occur, for ex- 
ample, due to a route change or when a router deletes 
the rule from its cache. In such cases, the router can sim- 
ply drop the packet in question, if the endpoints include 
the rule on all retransmissions and during periods of high 
packet loss. Of course, caching imposes additional stor- 
age overhead at routers as we evaluate shortly. 

In summary, based on our evaluation, we see that the 
per-packet overhead due to RBF can range from as low 
as 24B when using caching and up to ~250B in the bad 
case where there is no caching and the packet carries 
complex destination and return rules. 


7.2 Router Overhead 


In this section, we evaluate the overhead RBF imposes 
on routers for rules that do not invoke specialized 
processing functions; we consider router functions in the 
following section. The primary overhead RBF imposes 
on routers is the additional processing required to 
execute and authenticate rules and the additional storage 
capacity required if rules are cached. In this paper we 
do not evaluate rule authentication, which we assume 
is done by specialized hardware at trust-boundary 
routers; in [42] we present an evaluation for software 
rule authentication using RSA signatures, and show that 
our software router is not significantly slowed down 
when forwarding realistic traffic traces and performing 
verifications (the slowdown is less than 10%). 


Rule Forwarding: We first measure the overhead of rule 
processing by comparing the performance of RBF-on- 
RouteBricks to that of unmodified RouteBricks running 
on a single high-end server machine. We use a dual- 
socket server with four 2.8GHz Intel Xeon (X5560) cores 
per socket to (from) which we generate (sink) traffic over 
two dual-port 10G NICs. In this experiment, we use all 8 
cores to forward packets. 

Fig. 6 plots forwarding rates for some of the examples 
from Fig. 5. The first column represents a packet stream 
with sizes generated based on a packet trace collected on 
the Abilene backbone [11]; since the packets from the 
trace do not have rules, we add to each packet the slowest 
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Figure 6: Forwarding speed for RBF over RouteBricks 


rule that fits in the packet. By “slowest” we mean the 
rule that takes the longest time to forward, as determined 
by the number of conditions and actions encountered 
during forwarding. To capture the performance impact 
for small packets, we profile each rule without any 
payload and with no return rules. In the figure, packet 
sizes are shown next to the example name and entries 
are sorted in order of increasing packet size; the packet 
size also includes the Ethernet and IP headers. The last 
columns depict forwarding of larger packets, i.e., that 
also contain data payload. To see the impact of the 
type of rule for these packets, we profiled them with 
the fastest and the slowest rules. Note that all rules are 
profiled in the worst case, meaning that the longest path 
through the rule is considered. For the slowest rule we 
use a 145B anycast rule which selects one out of 10 
destinations based on the value of a packet attribute. 

Overall, we see in Fig. 6 that the performance degrada- 
tion due to RBF’s more complex per-packet processing 
is always modest (<15%) and virtually non-existent at 
larger packet sizes. For small packets the CPU is the for- 
warding bottleneck, and RBF’s added processing slows 
the router. For larger packets the I/O system is the bot- 
tleneck, and there are enough free CPU cycles to execute 
rules. A fine-grained profile of the rule execution module 
showed that it uses between 120 CPU cycles per packet 
for the fastest rule and 600 CPU cycles for the slowest 
rule; in comparison, the IP router used in our experi- 
ments requires around 3000 cycles per packet without 
rule execution. Also note that compared to the network- 
level forwarding results from Fig. 6, application-level 
goodput is further reduced by the RBF header. 


Router cache sizes: We earlier proposed that routers 
cache rule authentications and/or rule descriptions. In 
each case, the number of cache entries required depends 
on the number of distinct rules the router sees. If we 
assume that all packets in a flow share the same rule, 
then the number of distinct rules passing through a given 
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router varies between the worst case of O(#flows) to the 
best case of O(#destinations) seen by the router. The for- 
mer corresponds to a destination that uses a different rule 
for every source it communicates with, the latter to a des- 
tination that uses a single rule for all potential sources. 
In our implementation, each cached authentication 
is 19 bytes — 11B for the rule identifier and an 8-byte 
hash value used to verify whether the rule has changed 
since it was authenticated. Each router uses its own 
secret hash function to prevent attackers from using 
hash collisions. Thus, one million rules would require 
only 19MB of memory. For caching entire rules, Fig. 5 
reveals average and worst-case rule sizes of 85 and 
133 bytes, respectively. If we conservatively assume 
traffic is uniformly distributed across these forwarding 
categories, we arrive at an estimated cache size of 85MB 
(average) to 133MB (worst-case) for 1M rules, which is 
within the scope of memory available in current routers. 


7.3 Router Functions 


Our router prototype supports specialized functions 
implemented at either kernel- or user-level. We currently 
support three router functions: (i) the Snort IDS [13] 
adapted to run as a user-level function, (ii) a kernel-mode 
firewall implemented in Click and (iii) a kernel-mode 
encryption engine also implemented in Click. Each 
function runs as a separate process/kernel thread isolated 
from the packet forwarding path through queues. We 
measure performance and fairness using the above 
functions on the same hardware as before. We dedicate 
four cores to the standard forwarding path and the 
remaining four cores to custom functions. 

Fig. 7(a) illustrates the resource isolation between 
forwarding and router functions; the function used in this 
experiment is Snort (running on four cores). To generate 
traffic we use real traces of (moderately) malicious 
traffic created particularly for IDS testing [10, 30]. The 
average packet size of the trace used was 1065 bytes. To 
avoid biasing our results, we modify Snort not to drop 
any malicious packets so packets are only dropped due 
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to resource exhaustion. Our test maintains constant total 
input traffic while increasing the percentage of input 
traffic that invokes Snort (X-axis). We see from Fig. 7(a) 
that Snort traffic does not affect the “regular” traffic that 
does not invoke Snort, in the sense that no regular traffic 
is dropped, even as a growing percentage of input Snort 
traffic is dropped. We observed the same isolation when 
using traces with small packets (see [42]).° 

Fig. 7(b) illustrates isolation between router functions. 
We run three experiments: (1) all traffic invokes the 
firewall function and no traffic invokes encryption; (2) all 
traffic invokes encryption; and (3) equal halves of traffic 
invoking the firewall and encryption. Fig. 7(b) plots 
the resulting forwarding rates under increasing input 
traffic. In the third (shared) test the CPU is shared fairly 
between functions (we use Click-level scheduling); thus, 
the ratio between the maximum throughputs achieved 
by each router function is expected to roughly match 
the ratio between the throughputs of the functions when 
running in isolation. In Fig. 7(b) the encryption through- 
put is higher for a mix of firewalled and encrypted traffic 
than 50% of that when encryption is executed alone 
because the trace contains large packets. In this case, the 
CPU is not the bottleneck for the firewall functionality 
but is the bottleneck for encryption (since it is more 
CPU-intensive), and thus encryption ends up using 
the leftover firewall CPU cycles. If small packets are 
used, both functionalities achieve around 50% of their 
throughput in isolation [42]. Note that the high rates 
achieved by running each function in isolation illustrates 
the benefit of running instances of a single function at 
multiple cores (as opposed to one function per core) 
since this allows the unused resources from one function 
to be seamlessly utilized by other functions. 


7.4 RCE Load 


We use a simple back-of-the-envelope calculation to 
estimate the total number of RCE servers required for the 
Internet. The bulk of requests to RCEs are determined 
by IP address changes and per-client certifications re- 
quested by sites that protect against DoS (by redirecting 
DNS requests to powerful entities, see §3.4,84.2). Note 
that in the latter case, requests to RCEs are made only 
for approved customers. There are currently around 700 
million hosts in the Internet [9]; given the current trend 
of smart mobile devices we consider | billion hosts. We 
assume a worst-case scenario in which all hosts request 
certifications in the same second; these requests are 
made either by hosts individually to certify a rule or by 


°We also measured the performance of the system with all 
the eight cores running both forwarding and Snort, and all the 
packets directed to Snort. While this configuration does not 
provide isolation for the regular traffic, it can forward a higher 
total throughput of 22Gbps. 
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websites hosts are trying to access. We implemented 
RCE rule certification in software using RSA signatures, 
and measured it on the same 8-core server used through- 
out our evaluation. We find a single server can achieve a 
certification rate of over 16,000 rules per second. Based 
on benchmarks of our implementation and assuming 
an oversubscription rate of 10x (ISPs today commonly 
oversubscribe by 100 x), the total load due to certifying 
rules above could be accommodated by around 6,000 
servers; e.g., handled by 20 RCEs with 300 servers each. 
Hardware implementations might reduce this number 
by more than an order of magnitude. For example, 
using recent ECC prototypes [53, 34] a single ASIC 
could potentially perform 40,000 RCE certifications per 
second, requiring a total of only 2500 such devices. 


8 Related Work 


RBF is inspired by and extends several directions in past 
research. RBF’s contribution is in offering extensive flex- 
ibility while respecting policies, where prior approaches 
tended to focus on one or the other. Fig. 8 compares the 
flexibility and security features of RBF with those of pre- 
vious proposals. RBF is complementary to recent efforts 
proposing open router APIs [15, 16, 38, 12] — we offer 
an overall network design by which endpoints use the 
new functionality these router architectures promise to 
enable. This paper extends an earlier position paper [43] 
that argued the case for a rule-based architecture. 


A key feature that distinguishes RBF from previous 
proposals and allows it to achieve both flexibility and 
policy compliance is its division of functionality between 
the data and control planes. Active Networks typically 
make little use of the control plane, as they deploy the 
forwarding functionality and enforce security on the data 
plane. This makes policy compliance hard to achieve. In 
contrast, more recent proposals such as OpenFlow [12] 
rely heavily on the control plane and install flow state 
in the network to make sure the data plane respects the 
appropriate policies. This approach, while simplifies the 
data plane, results in a more rigid architecture. For ex- 
ample, supporting host mobility and traffic engineering 


require tearing down the old paths and instantiating new 
ones. These are expensive operations which have a nega- 
tive impact on the scalability of these proposals. In con- 
trast, with RBF, each packet contains (in its rule) enough 
information to prove to routers that it respects the poli- 
cies of all participants involved in forwarding the packet. 
RBF achieves this property despite the fact that neither 
the routers nor the packet contain the policies. Thus, RBF 
retains the datagram model of the IP, unlike other re- 
cent proposals (e.g., network capabilities [52, 51], IC- 
ING [46] and OpenFlow [12]), which are more akin 
to a connection-oriented model. Finally, while overlay- 
based architectures can implement more sophisticated 
data plane or control plane mechanisms, they cannot 
leverage support at routers and are thus less powerful.’ 


9 Incremental Deployment 


All the benefits of RBF shown in Fig. 8 except re- 
ceiver reachability control and DDoS protection can 
be achieved with a partial deployment of RBF routers 
and middleboxes. In an initial phase, RBF routers could 
support both RBF and legacy (non-RBF) traffic. To also 
offer DoS protection and reachability control, individual 
ASes can upgrade to RBF by dropping legacy traffic. 
Hosts in such ASes can use multihoming to handle 
legacy traffic, although they will be vulnerable to DoS 
attacks on legacy interfaces. 


10. Discussion 


We have presented RBF, an architecture we have argued 
strikes a desirable balance between flexibility and the 
ability to guarantee policy compliance of all network 
entities. We started this work with two high level goals 
in mind. First, we wanted a complete architecture that 
supports not only previously proposed communication 
primitives, but also future ones. Second, we wanted an 


"For example, overlay architectures can only drop un- 
wanted packets at overlay nodes and hence cannot create a 
network that is fundamentally default-off; once the network- 
layer address of a node is known, it can always be attacked at 
the underlying network layer. 
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efficient architecture in which a packet unwanted by a 
receiver along its path is dropped as early as possible. 

While completeness in this context is difficult to for- 
malize, intuitively we have reduced it to (1) supporting 
arbitrary communication paths, and (2) allowing all 
network entities (i.e., sender, receivers, middleboxes, 
and routers) to be involved in the decision process. In 
other words, we wanted to be able to define virtually 
any forwarding path and give all involved parties a say 
in defining it. We noted that such a path can be encoded 
by associating with each node an “if-then-else” code 
snippet, which specifies the next node down the path. 
We further noted that allowing different network entities 
to define the communication pattern is equivalent to 
allowing them to define these code snippets. This is 
roughly what the RBF proposal is. 

These goals are ambitious — they subsume, unite and 
extend many years of proposals for greater flexibility and 
security in networks — and much of RBF’s complexity 
follows from these goals. 

Finally, one might question whether a relaxation of 
RBF’s goals might lead to a significantly simpler design. 
This is a valid question that we leave for future work. 
We believe that understanding the fundamental tradeoffs 
associated with these goals is critical and, at the very 
least, that RBF is a step toward arriving at such an 
understanding. 
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Abstract 


Today’s smartphone operating systems frequently fail 
to provide users with adequate control over and visibility 
into how third-party applications use their private data. 
We address these shortcomings with TaintDroid, an ef- 
ficient, system-wide dynamic taint tracking and analy- 
sis system capable of simultaneously tracking multiple 
sources of sensitive data. TaintDroid provides realtime 
analysis by leveraging Android’s virtualized execution 
environment. TaintDroid incurs only 14% performance 
overhead on a CPU-bound micro-benchmark and im- 
poses negligible overhead on interactive third-party ap- 
plications. Using TaintDroid to monitor the behavior of 
30 popular third-party Android applications, we found 
68 instances of potential misuse of users’ private infor- 
mation across 20 applications. Monitoring sensitive data 
with TaintDroid provides informed use of third-party ap- 
plications for phone users and valuable input for smart- 
phone security service firms seeking to identify misbe- 
having applications. 


1 Introduction 


A key feature of modern smartphone platforms is a 
centralized service for downloading third-party applica- 
tions. The convenience to users and developers of such 
“app stores” has made mobile devices more fun and use- 
ful, and has led to an explosion of development. Apple’s 
App Store alone served nearly 3 billion applications af- 
ter only 18 months [4]. Many of these applications com- 
bine data from remote cloud services with information 
from local sensors such as a GPS receiver, camera, mi- 
crophone, and accelerometer. Applications often have le- 
gitimate reasons for accessing this privacy sensitive data, 
but users would also like assurances that their data is used 
properly. Incidents of developers relaying private infor- 
mation back to the cloud [35, 12] and the privacy risks 
posed by seemingly innocent sensors like accelerome- 
ters [19] illustrate the danger. 
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Resolving the tension between the fun and utility of 
running third-party mobile applications and the privacy 
risks they pose is a critical challenge for smartphone plat- 
forms. Mobile-phone operating systems currently pro- 
vide only coarse-grained controls for regulating whether 
an application can access private information, but pro- 
vide little insight into how private information is actu- 
ally used. For example, if a user allows an application 
to access her location information, she has no way of 
knowing if the application will send her location to a 
location-based service, to advertisers, to the application 
developer, or to any other entity. As a result, users must 
blindly trust that applications will properly handle their 
private data. This lack of transparency forces users to 
blindly trust that applications will properly handle pri- 
vate data. 


This paper describes TaintDroid, an extension to the 
Android mobile-phone platform that tracks the flow of 
privacy sensitive data through third-party applications. 
TaintDroid assumes that downloaded, third-party appli- 
cations are not trusted, and monitors—in realtime—how 
these applications access and manipulate users’ personal 
data. Our primary goals are to detect when sensitive data 
leaves the system via untrusted applications and to facil- 
itate analysis of applications by phone users or external 
security services [33, 55]. 

Analysis of applications’ behavior requires sufficient 
contextual information about what data leaves a device 
and where it is sent. Thus, TaintDroid automatically 
labels (taints) data from privacy-sensitive sources and 
transitively applies labels as sensitive data propagates 
through program variables, files, and interprocess mes- 
sages. When tainted data are transmitted over the net- 
work, or otherwise leave the system, TaintDroid logs the 
data’s labels, the application responsible for transmitting 
the data, and the data’s destination. Such realtime feed- 
back gives users and security services greater insight into 
what mobile applications are doing, and can potentially 
identify misbehaving applications. 
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To be practical, the performance overhead of the Taint- 
Droid runtime must be minimal. Unlike existing so- 
lutions that rely on heavy-weight whole-system emula- 
tion [7, 57], we leveraged Android’s virtualized archi- 
tecture to integrate four granularities of taint propaga- 
tion: variable-level, method-level, message-level, and 
file-level. Though the individual techniques are not 
new, our contributions lie in the integration of these 
techniques and in identifying an appropriate trade-off 
between performance and accuracy for resource con- 
strained smartphones. Experiments with our prototype 
for Android show that tracking incurs a runtime over- 
head of less than 14% for a CPU-bound microbench- 
mark. More importantly, interactive third-party applica- 
tions can be monitored with negligible perceived latency. 

We evaluated the accuracy of TaintDroid using 30 ran- 
domly selected, popular Android applications that use lo- 
cation, camera, or microphone data. TaintDroid correctly 
flagged 105 instances in which these applications trans- 
mitted tainted data; of the 105, we determined that 37 
were clearly legitimate. TaintDroid also revealed that 15 
of the 30 applications reported users’ locations to remote 
advertising servers. Seven applications collected the de- 
vice ID and, in some cases, the phone number and the 
SIM card serial number. In all, two-thirds of the applica- 
tions in our study used sensitive data suspiciously. Our 
findings demonstrate that TaintDroid can help expose po- 
tential misbehavior by third-party applications. 

Like similar information-flow tracking systems [7, 
57], a fundamental limitation of TaintDroid is that it can 
be circumvented through leaks via implicit flows. The 
use of implicit flows to avoid taint detection is, in and of 
itself, an indicator of malicious intent, and may well be 
detectable through other techniques such as automated 
static code analysis [14, 46] as we discuss in Section 8. 

The rest of this paper is organized as follows: Sec- 
tion 2 provides a high-level overview of TaintDroid, Sec- 
tion 3 describes background information on the Android 
platform, Section 4 describes our TaintDroid design, 
Section 5 describes the taint sources tracked by Taint- 
Droid, Section 6 presents results from our Android ap- 
plication study, Section 7 characterizes the performance 
of our prototype implementation, Section 8 discusses the 
limitations of our approach, Section 9 describes related 
work, and Section 10 summarizes our conclusions. 


2 Approach Overview 


We seek to design a framework that allows users to 
monitor how third-party smartphone applications handle 
their private data in realtime. Many smartphone appli- 
cations are closed-source, therefore, static source code 
analysis is infeasible. Even if source code is available, 
runtime events and configuration often dictate informa- 
tion use; realtime monitoring accounts for these environ- 
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ment specific dependencies. 
Monitoring network disclosure of privacy sensitive in- 
formation on smartphones presents several challenges: 


e Smartphones are resource constrained. The re- 
source limitations of smartphones precludes the use 
of heavyweight information tracking systems such 
as Panorama [57]. 


e Third-party applications are entrusted with several 
types of privacy sensitive information. The mon- 
itoring system must distinguish multiple informa- 
tion types, which requires additional computation 
and storage. 


e Context-based privacy sensitive information is dy- 
namic and can be difficult to identify even when 
sent in the clear. For example, geographic locations 
are pairs of floating point numbers that frequently 
change and are hard to predict. 


e Applications can share information. Limiting the 
monitoring system to a single application does not 
account for flows via files and IPC between applica- 
tions, including core system applications designed 
to disseminate privacy sensitive information. 


We use dynamic taint analysis [57, 44, 8, 61, 39] (also 
called “taint tracking”) to monitor privacy sensitive in- 
formation on smartphones. Sensitive information is first 
identified at a taint source, where a taint marking indi- 
cating the information type is assigned. Dynamic taint 
analysis tracks how labeled data impacts other data in a 
way that might leak the original sensitive information. 
This tracking is often performed at the instruction level. 
Finally, the impacted data is identified before it leaves 
the system at a faint sink (usually the network interface). 

Existing taint tracking approaches have several lim- 
itations. First and foremost, approaches that rely on 
instruction-level dynamic taint analysis using whole sys- 
tem emulation [57, 7, 26] incur high performance penal- 
ties. Instruction-level instrumentation incurs 2-20 times 
slowdown [57, 7] in addition to the slowdown introduced 
by emulation, which is not suitable for realtime analysis. 
Second, developing accurate taint propagation logic has 
proven challenging for the x86 instruction set [40, 48]. 
Implementations of instruction-level tracking can experi- 
ence taint explosion if the stack pointer becomes falsely 
tainted [49] and taint loss if complicated instructions 
such as CMPXCHG, REP MOV are not instrumented 
properly [61]. While most smartphones use the ARM 
instruction set, similar false positives and false negatives 
could arise. 

Figure 1 presents our approach to taint tracking on 
smartphones. We leverage architectural features of vir- 
tual machine-based smartphones (e.g., Android, Black- 
Berry, and J2ME-based phones) to enable efficient, 
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Figure 1: Multi-level approach for performance efficient 
taint tracking within a common smartphone architecture. 


system-wide taint tracking using fine-grained labels with 
clear semantics. First, we instrument the VM interpreter 
to provide variable-level tracking within untrusted ap- 
plication code.! Using variable semantics provided by 
the interpreter provides valuable context for avoiding 
the taint explosion observed in the x86 instruction set. 
Additionally, by tracking variables, we maintain taint 
markings only for data and not code. Second, we use 
message-level tracking between applications. Tracking 
taint on messages instead of data within messages mini- 
mizes IPC overhead while extending the analysis system- 
wide. Third, for system-provided native libraries, we use 
method-level tracking. Here, we run native code with- 
out instrumentation and patch the taint propagation on 
return. These methods accompany the system and have 
known information flow semantics. Finally, we use file- 
level tracking to ensure persistent information conserva- 
tively retains its taint markings. 


To assign labels, we take advantage of the well- 
defined interfaces through which applications access sen- 
sitive data. For example, all information retrieved from 
GPS hardware is location-sensitive, and all informa- 
tion retrieved from an address book database is contact- 
sensitive. This avoids relying on heuristics [10] or man- 
ual specification [61] for labels. We expand on informa- 
tion sources in Section 5. 


In order to achieve this tracking at multiple granulari- 
ties, our approach relies on the firmware’s integrity. The 
taint tracking system’s trusted computing base includes 
the virtual machine executing in userspace and any na- 
tive system libraries loaded by the untrusted interpreted 
application. However, this code is part of the firmware, 
and is therefore trusted. Applications can only escape 
the virtual machine by executing native methods. In our 
target platform (Android), we modified the native library 
loader to ensure that applications can only load native li- 
braries from the firmware and not those downloaded by 
the application. Note that an early 2010 survey of the top 
50 most popular free applications in each category of the 
Android Market [2] (1100 applications in total) revealed 
that less than 4% included a . so file. A similar survey 
conducted in mid 2010 revealed this fraction increased to 


5%, which indicates there is growth in the number of ap- 
plications using native third-party libraries, but that the 
number of affected applications remains small. 

In summary, we provide a novel, efficient, system- 
wide, multiple-marking, taint tracking design by com- 
bining multiple granularities of information tracking. 
While some techniques such as variable tracking within 
an interpreter have been previously proposed (see Sec- 
tion 9), to our knowledge, our approach is the first to 
extend such tracking system-wide. By choosing a mul- 
tiple granularity approach, we balance performance and 
accuracy. As we show in Sections 6 and 7, our system- 
wide approach is both highly efficient (~ 14% CPU over- 
head and ~4.4% memory overhead for simultaneously 
tracking 32 taint markings per data unit) and accurately 
detects many suspicious network packets. 


3 Background: Android 


Android [1] is a Linux-based, open source, mobile 
phone platform. Most core phone functionality is imple- 
mented as applications running on top of a customized 
middleware. The middleware itself is written in Java 
and C/C++. Applications are written in Java and com- 
piled to a custom byte-code known as the Dalvik EXe- 
cutable (DEX) byte-code format. Each application exe- 
cutes within its Dalvik VM interpreter instance. Each in- 
stance executes as unique UNIX user identities to isolate 
applications within the Linux platform subsystem. Ap- 
plications communicate via the binder IPC mechanism. 
Binder provides transparent message passing based on 
parcels. We now discuss topics necessary to understand 
our tracking system. 


Dalvik VM Interpreter: DEX is a register-based ma- 
chine language, as opposed to Java byte-code, which is 
stack-based. Each DEX method has its own predefined 
number of virtual registers (which we frequently refer to 
as simply “registers”). The Dalvik VM interpreter man- 
ages method registers with an internal execution state 
stack; the current method’s registers are always on the 
top stack frame. These registers loosely correspond to 
local variables in the Java method and store primitive 
types and object references. All computation occurs 
on registers, therefore values must be loaded from and 
stored to class fields before use and after use. Note that 
DEX uses class fields for all long term storage, unlike 
hardware register-based machine languages (e.g., x86), 
which store values in arbitrary memory locations. 


Native Methods: The Android middleware provides ac- 
cess to native libraries for performance optimization and 
third-party libraries such as OpenGL and Webkit. An- 
droid also uses Apache Harmony Java [3], which fre- 
quently uses system libraries (e.g., math routines). Na- 
tive methods are written in C/C++ and expose function- 
ality provided by the underlying Linux kernel and ser- 
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Figure 2: TaintDroid architecture within Android. 


vices. They can also access Java internals, and hence are 
included in our trusted computing base (see Section 2). 
Android contains two types of native methods: inter- 
nal VM methods and JNI methods. The internal VM 
methods access interpreter-specific structures and APIs. 
JNI methods conform to Java native interface standards 
specifications [32], which requires Dalvik to separate 
Java arguments into variables using a JNI call bridge. 
Conversely, internal VM methods must manually parse 
arguments from the interpreter’s byte array of arguments. 


Binder IPC: All Android IPC occurs through binder. 
Binder is a component-based processing and IPC frame- 
work designed for BeOS, extended by Palm Inc., and 
customized for Android by Google. Fundamental to 
binder are parcels, which serialize both active and stan- 
dard data objects. The former includes references to 
binder objects, which allows the framework to manage 
shared data objects between processes. A binder kernel 
module passes parcel messages between processes. 


4 TaintDroid 


TaintDroid is a realization of our multiple granularity 
taint tracking approach within Android. TaintDroid uses 
variable-level tracking within the VM interpreter. Mul- 
tiple taint markings are stored as one taint tag. When 
applications execute native methods, variable taint tags 
are patched on return. Finally, taint tags are assigned 
to parcels and propagated through binder. Note that 
the Technical Report [17] version of this paper contains 
more implementation details. 

Figure 2 depicts TaintDroid’s architecture. Informa- 
tion is tainted (1) in a trusted application with sufficient 
context (e.g., the location provider). The taint inter- 
face invokes a native method (2) that interfaces with the 
Dalvik VM interpreter, storing specified taint markings 
in the virtual taint map. The Dalvik VM propagates taint 
tags (3) according to data flow rules as the trusted ap- 
plication uses the tainted information. Every interpreter 
instance simultaneously propagates taint tags. When the 
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trusted application uses the tainted information in an IPC 
transaction, the modified binder library (4) ensures the 
parcel has a taint tag reflecting the combined taint mark- 
ings of all contained data. The parcel is passed transpar- 
ently through the kernel (5) and received by the remote 
untrusted application. Note that only the interpreted code 
is untrusted. The modified binder library retrieves the 
taint tag from the parcel and assigns it to all values read 
from it (6). The remote Dalvik VM instance propagates 
taint tags (7) identically for the untrusted application. 
When the untrusted application invokes a library spec- 
ified as a taint sink (8), e.g., network send, the library 
retrieves the taint tag for the data in question (9) and re- 
ports the event. 

Implementing this architecture requires addressing 
several system challenges, including: a) taint tag stor- 
age, b) interpreted code taint propagation, c) native code 
taint propagation, d) IPC taint propagation, and e) sec- 
ondary storage taint propagation. The remainder of this 
section describes our design. 


4.1 Taint Tag Storage 


The choice of how to store taint tags influences per- 
formance and memory overhead. Dynamic taint track- 
ing systems commonly store tags for every data byte or 
word [57, 7]. Tracked memory is unstructured and with- 
out content semantics. Frequently taint tags are stored 
in non-adjacent shadow memory [57] and tag maps [61]. 
TaintDroid uses variable semantics within the Dalvik in- 
terpreter. We store taint tags adjacent to variables in 
memory, providing spatial locality. 

Dalvik has five variable types that require taint stor- 
age: method local variables, method arguments, class 
static fields, class instance fields, and arrays. In all cases, 
we store a 32-bit bitvector with each variable to encode 
the taint tag, allowing 32 different taint markings. 

Dalvik stores method local variables and arguments 
on an internal stack. When an application invokes a 
method, a new stack frame is allocated for all local vari- 
ables. Method arguments are also passed via the internal 
stack. Before calling a method, the callee places the ar- 
guments on the top of the stack such that they become 
high numbered registers in the callee’s stack frame. We 
allocate taint tag storage by doubling the size of the stack 
frame allocation. Taint tags are interleaved between val- 
ues such that register v; originally accessed via fp|i] is 
accessed as fp[2 - i] after modification. Note that Dalvik 
stores 64-bit variables as two adjacent 32-bit registers on 
the internal stack. While the byte-code interprets these 
adjacent registers as a single 64-bit value, the interpreter 
manages these registers as separate values. Therefore, 
our modified stack transparently stores and retrieves 64- 
bit values to and from separate 32-bit registers (at fp[2-7] 
and fp|2- i+ 2]). Finally, native method targets require 
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Figure 3: Modified Stack Format. Taint tags are inter- 
leaved between registers for interpreted method targets 
and appended for native methods. Dark grayed boxes 
represent taint tags. 


a slightly different stack frame organization for reasons 
discussed in Section 4.3. The modified stack format is 
shown in Figure 3. 


Taint tags are stored adjacent to class fields and ar- 
rays inside the VM interpreter’s internal data structures. 
TaintDroid stores only one taint tag per array to minimize 
storage overhead. Per-value taint tag storage is severely 
inefficient for Java String objects, as all characters have 
the same tag. Unfortunately, storing one taint tag per ar- 
ray may result in false positives during taint propagation. 
For example, if untainted variable wu is stored into array A 
at index 0 (A[0]) and tainted variable t is stored into A[1], 
then array A is tainted. Later, if variable v is assigned 
to A[0], v will be tainted, even though u was untainted. 
Fortunately, Java frequently uses objects, and object ref- 
erences are infrequently tainted (see Section 4.2), there- 
fore this coding practice leads to less false positives. 


4.2 Interpreted Code Taint Propagation 


Taint tracking granularity and flow semantics influ- 
ence performance and accuracy. TaintDroid implements 
variable-level taint tracking within the Dalvik VM in- 
terpreter. Variables provide valuable semantics for taint 
propagation, distinguishing data pointers from scalar val- 
ues. TaintDroid primarily tracks primitive type variables 
(e.g., int, float, etc); however, there are cases when object 
references must become tainted to ensure taint propaga- 
tion operates correctly; this section addresses why these 
cases exist. However, first we present taint tracking in 
the Dalvik machine language as a formal logic. 


4.2.1 Taint Propagation Logic 


The Dalvik VM operates on the unique DEX machine 
language instruction set, therefore we must design an ap- 
propriate propagation logic. We use a data flow logic, as 
tracking implicit flows requires static analysis and causes 
significant performance overhead and overestimation in 
tracking [29] (see Section 8). We begin by defining taint 
markings, taint tags, variables, and taint propagation. We 
then present our logic rules for DEX. 

Let £ be the universe of taint markings for a particular 
system. A taint tag ¢ is a set of taint markings, t C CL. 
Each variable has an associated taint tag. A variable is an 
instance of one of the five types described in Section 4.1. 
We use a different representation for each type. The local 
and argument variables correspond to virtual registers, 
denoted v,,. Class field variables are denoted as f, to in- 
dicate a field variable with class index x. Instance fields 
require an instance object and are denoted v,( fz), where 
Uy is the instance object reference (note that both the ob- 
ject reference and the dereferenced value are variables). 
Static fields are denoted as f, alone, which is shorthand 
for S(f), where S'() is the static scope. Finally, v,[-] 
denotes an array, where vz is an array object reference 
variable. 

Our virtual taint map function is 7(-). 7(v) returns the 
taint tag t for variable v. 7(v) is also used to assign a 
taint tag to a variable. Retrieval and assignment are dis- 
tinguished by the position of 7(-) w.rt. the — symbol. 
When 7(v) appears on the right hand side of —, 7(v) re- 
trieves the taint tag for v. When 7(v) appears on the left 
hand side, 7(v) assigns the taint tag for v. For example, 
T(v1) — T(v2) copies the taint tag from v2 to v4. 

Table 1 captures our propagation logic. The table enu- 
merates abstracted versions of the byte-code instructions 
specified in the DEX documentation. Register variables 
and class fields are referenced by vx and fx, respec- 
tively. R and F are the return and exception variables 
maintained within the interpreter, respectively. A, B, and 
C are constants in the byte-code. The table does not list 
instructions that clear the taint tag of the destination reg- 
ister. For example, we do not consider the array-length 
instruction to return a tainted value even if the array is 
tainted. Note that the array length is sometimes used to 
aid direct control flow propagation (e.g., Vogt et al. [53]). 


4.2.2 Tainting Object References 


The propagation rules in Table | are straightforward 
with two exceptions. First, taint propagation logics com- 
monly include the taint tag of an array index during 
lookup to handle translation tables (e.g., ASCII/UNI- 
CODE or character case conversion). For example, con- 
sider a translation table from lowercase to upper case 
characters: if a tainted value “a” is used as an array index, 
the resulting “A” value should be tainted even though the 
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Table 1: DEX Taint Propagation Logic. Register variables and class fields are referenced by vx and fx, respectively. 
Rand EF are the return and exception variables maintained within the interpreter. A, B, and C are byte-code constants. 

















Op Format Op Semantics Taint Propagation Description 

const-op va C va C T(va) —O Clear v , taint 

move-op VA UB VA <— UB T(vAa) — T(vB) Set v, taint to vg taint 

move-op-R vA va~R T(va) — T(R) Set uv taint to return taint 
return-op VA Ro-va 7T(R) — T(va) Set return taint (@ if void) 
move-op-E vA vaAcE T(va) — T(B) Set v4 taint to exception taint 
throw-op VA Ev, T(E) —T(va) Set exception taint 

unary-op VA UB VA — @uB T(vA) — T(vB) Set v4 taint to vg taint 
binary-op VA VB VG VA —VBRVUG T(vA) —T(UB)UT(UC) Set vg taint to vg taint U vo taint 
binary-op VA UB VA—vUA®BVUB TVA) —T(VA)UT(vB) Update v4 taint with vp taint 
binary-op vA vB C VA —VBeC T(vA) — T(vB) Set v, taint to vg taint 

aput-op UA UB UG vplvc] — vA T(vB[]) —T(vB[]) UT(va) — Update array vg taint with vy taint 
aget-op VA UB VC vA — uBluc] T(va) — T(vBl]) U Tuc) Set va taint to array and index taint 
sput-op vA fB fava t(fB) — Twa) Set field fg taint to vy taint 
sget-op vA fB vA —fB T(va) — T(fB) Set va taint to field fg taint 
iput-op vA UB fo vB(fc) — VA T(uB(fo)) — T(va) Set field fo taint to vy taint 
iget-op v4 UB fo va — uB(fo) T(va) — T(vB(fo))UT(vB) Set vg taint to field f¢ and object reference taint 











public static Integer valueOf (int 
LE -(1 < =128: || ode S127)" 4 
return new Integer(i); } 


return valueOfCache.CACHE [i+128]; 
} 
static class valueOfCache { 
static final Integer[] CACHE = new Integer[256]; 
static { 
for(int i=-128; i<=127; i++) { 
CACHE[i+128] = new Integer(i); } } 


} 





Figure 4: Excerpt from Android’s Integer class illustrat- 
ing the need for object reference taint propagation. 


“A” value in the array is not. Hence, the taint logic for 
aget-op uses both the array and array index taint. Sec- 
ond, when the array contains object references (e.g., an 
Integer array), the index taint tag is propagated to the ob- 
ject reference and not the object value. Therefore, we 
include the object reference taint tag in the instance get 
(iget-op) rule. 

The code listed in Figure 4 demonstrates a real in- 
stance of where object reference tainting is needed. Here, 
valueOf() returns an Integer object for a passed int. If the 
int argument is between —128 and 127, valueOf() returns 
reference to a statically defined Integer object. valueOf() 
is implicitly called for conversion to an object. Consider 
the following definition and use of a method intProxy(). 


Object intProxy(int val) { return val; } 
int out (Integer) intProxy(tVal); 


Consider the case where tVal is an int with value 1 
and taint tag TAG. When intProxy() is passed tVal, TAG 
is propagated to val. When intProxy() returns val, it 
calls Integer.valueOf() to obtain an Integer instance cor- 
responding to the scalar variable val. In this case, Inte- 
ger.valueOf() returns a reference to the static Integer ob- 
ject with value 1. The value field (of the Integer class) in 
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the object has taint tag of §; however, since the aget-op 
propagation rule includes the taint of the index register, 
the object reference has a taint tag of TAG. Therefore, 
only by including the object reference taint tag when the 
value field is read from the Integer (i.e., the iget-op prop- 
agation rule), will the correct taint tag of TAG be assigned 
to out. 


4.3 Native Code Taint Propagation 


Native code is unmonitored in TaintDroid. Ideally, 
we achieve the same propagation semantics as the in- 
terpreted counterpart. Hence, we define two necessary 
postconditions for accurate taint tracking in the Java- 
like environment: 1) all accessed external variables (i.e., 
class fields referenced by other methods) are assigned 
taint tags according to data flow rules; and 2) the re- 
turn value is assigned a taint tag according to data flow 
tules. TaintDroid achieves these postconditions through 
an assortment of manual instrumentation, heuristics, and 
method profiles, depending on situational requirements. 


Internal VM Methods: Internal VM methods are called 
directly by interpreted code, passing a pointer to an ar- 
ray of 32-bit register arguments and a pointer to a return 
value. The stack augmentation shown in Figure 3 pro- 
vides access to taint tags for both Java arguments and 
the return value. As there are a relatively small number 
of internal VM methods which are infrequently added 
between versions,” we manually inspected and patched 
them for taint propagation as needed. We identified 185 
internal VM methods in Android version 2.1; however, 
only 5 required patching: the System.arraycopy() native 
method for copying array contents, and several native 
methods implementing Java reflection. 


JNI Methods: JNI methods are invoked through the 
JNI call bridge. The call bridge parses Java arguments 
and assigns a return value using the method’s descriptor 
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string. We patched the call bridge to provide taint propa- 
gation for all JNI methods. When a JNI method returns, 
TaintDroid consults a method profile table for tag propa- 
gation updates. A method profile is a list of (from, to) 
pairs indicating flows between variables, which may be 
method parameters, class variables, or return values. 
Enumerating the information flows for all JNI methods 
is a time consuming task best completed automatically 
using source code analysis (a task we leave for future 
work). We currently include an additional propagation 
heuristic patch. The heuristic is conservative for JNI 
methods that only operate on primitive and String ar- 
guments and return values. It assigns the union of the 
method argument taint tags to the taint tag of the return 
value. While the heuristic has false negatives for meth- 
ods using objects, it covers many existing methods. 

We performed a survey of the JNI methods included 
in the official Android source code (version 2.1) to de- 
termine specific properties. We found 2,844 JNI meth- 
ods with a Java interface and C or C++ implementation.? 
Of these methods, 913 did not reference objects (as argu- 
ments, return value, or method body) and hence are auto- 
matically covered by our heuristic. The remaining meth- 
ods may or may not have information flows that produce 
false negatives. Currently, we define method profiles as 
needed. For example, methods in the IBM NativeCon- 
verter class require propagation for conversion between 
character and byte arrays. 


4.4 IPC Taint Propagation 


Taint tags must propagate between applications when 
they exchange data. The tracking granularity affects 
performance and memory overhead. TaintDroid uses 
message-level taint tracking. A message taint tag repre- 
sents the upper bound of taint markings assigned to vari- 
ables contained in the message. We use message-level 
granularity to minimize performance and storage over- 
head during IPC. 

We chose to implement message-level over variable- 
level taint propagation, because in a variable-level sys- 
tem, a devious receiver could game the monitoring by 
unpacking variables in a different way to acquire val- 
ues without taint propagation. For example, if an IPC 
parcel message contains a sequence of scalar values, the 
receiver may unpack a string instead, thereby acquiring 
values without propagating all the taint tags on scalar val- 
ues in the sequence. Hence, to prevent applications from 
removing taint tags in this way, the current implementa- 
tion protects taint tags at the message-level. 

Message-level taint propagation for IPC leads to false 
positives. Similar to arrays, all data items in a parcel 
share the same taint tag. For example, Section 8 dis- 
cusses limitations for tracking the IMSI that results from 
passing as portions the value as configuration parameters 


in parcels. Future implementations will consider word- 
level taint tags along with additional consistency checks 
to ensure accurate propagation for unpacked variables. 
However, this additional complexity will negatively im- 
pact IPC performance. 


4.5 Secondary Storage Taint Propagation 


Taint tags may be lost when data is written to a file. 
Our design stores one taint tag per file. The taint tag 
is updated on file write and propagated to data on file 
read. TaintDroid stores file taint tags in the file sys- 
tem’s extended attributes. To do this, we implemented 
extended attribute support for Android’s host file system 
(YAFFS2) and formatted the removable SDcard with the 
ext2 file system. As with arrays and IPC, storing one 
taint tag per file leads to false positives and limits the 
granularity of taint markings for information databases 
(see Section 5). Alternatively, we could track taint tags 
at a finer granularity at the expense of added memory and 
performance overhead. 


4.6 Taint Interface Library 


Taint sources and sinks defined within the virtualized 
environment must communicate taint tags with the track- 
ing system. We abstract the taint source and sink logic 
into a single taint interface library. The interface per- 
forms two functions: 1) add taint markings to variables; 
and 2) retrieve taint markings from variables. The library 
only provides the ability to add and not set or clear taint 
tags, as such functionality could be used by untrusted 
Java code to remove taint markings. 

Adding taint tags to arrays and strings via internal VM 
methods is straightforward, as both are stored in data ob- 
jects. Primitive type variables, on the other hand, are 
stored on the interpreter’s internal stack and disappear 
after a method is called. Therefore, the taint library uses 
the method return value as a means of tainting primitive 
type variables. The developer passes a value or variable 
into the appropriate add taint method (e.g., addTaintInt()) 
and the returned variable has the same value but addition- 
ally has the specified taint tag. Note that the stack storage 
does not pose complications for taint tag retrieval. 


5 Privacy Hook Placement 


Using TaintDroid for privacy analysis requires iden- 
tifying privacy sensitive sources and instrumenting taint 
sources within the operating system. Historically, dy- 
namic taint analysis systems assume taint source and sink 
placement is trivial. However, complex operating sys- 
tems such as Android provide applications information 
in a variety of ways, e.g., direct access, and service inter- 
face. Each potential type of privacy sensitive information 
must be studied carefully to determine the best method of 
defining the taint source. 
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Taint sources can only add taint tags to memory for 
which TaintDroid provides tag storage. Currently, taint 
source and sink placement is limited to variables in in- 
terpreted code, IPC messages, and files. This section 
discusses how valuable taint sources and sinks can be im- 
plemented within these restrictions. We generalize such 
taint sources based on information characteristics. 


Low-bandwidth Sensors: A variety of privacy sensitive 
information types are acquired through low-bandwidth 
sensors, e.g., location and accelerometer. Such informa- 
tion often changes frequently and is simultaneously used 
by multiple applications. Therefore, it is common for 
a smartphone OS to multiplex access to low-bandwidth 
sensors using a manager. This sensor manager represents 
an ideal point for taint source hook placement. For our 
analysis, we placed hooks in Android’s LocationMan- 
ager and SensorManager applications. 


High-bandwidth Sensors: Privacy sensitive informa- 
tion sources such as the microphone and camera are 
high-bandwidth. Each request from the sensor frequently 
returns a large amount of data that is only used by one 
application. Therefore, the smartphone OS may share 
sensor information via large data buffers, files, or both. 
When sensor information is shared via files, the file must 
be tainted with the appropriate tag. Due to flexible APIs, 
we placed hooks for both data buffer and file tainting for 
tracking microphone and camera information. 


Information Databases: Shared information such as ad- 
dress books and SMS messages are often stored in file- 
based databases. This organization provides a useful un- 
ambiguous taint source similar to hardware sensors. By 
adding a taint tag to such database files, all informa- 
tion read from the file will be automatically tainted. We 
used this technique for tracking address book informa- 
tion. Note that while TaintDroid’s file-level granularity 
was appropriate for these valuable information sources, 
others may exist for which files are too coarse grained. 
However, we have not yet encountered such sources. 


Device Identifiers: Information that uniquely identifies 
the phone or the user is privacy sensitive. Not all per- 
sonally identifiable information can be easily tainted. 
However, the phone contains several easily tainted iden- 
tifiers: the phone number, SIM card identifiers (IMSI, 
ICC-ID), and device identifier (IMEI) are all accessed 
through well-defined APIs. We instrumented the APIs 
for the phone number, ICC-ID, and IMEI. An IMSI taint 
source has inherent limitations discussed in Section 8. 


Network Taint Sink: Our privacy analysis identifies 
when tainted information transmits out the network in- 
terface. The VM interpreter-based approach requires the 
taint sink to be placed within interpreted code. Hence, 
we instrumented the Java framework libraries at the point 
the native socket library is invoked. 
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6 Application Study 


This section reports on an application study that uses 
TaintDroid to analyze how 30 popular third-party An- 
droid applications use privacy sensitive user data. Exist- 
ing applications acquire a variety of user data along with 
permissions to access the Internet. Our study finds that 
two thirds of these applications expose detailed location 
data, the phone’s unique ID, and the phone number using 
the combination of the seemingly innocuous access per- 
missions granted at install. This finding was made possi- 
ble by TaintDroid’s ability to monitor runtime access of 
sensitive user data and to precisely relate the monitored 
accesses with the data exposure by applications. 


6.1 Experimental Setup 


An early 2010 survey of the 50 most popular free ap- 
plications in each category of the Android Market [2] 
(1,100 applications, in total) revealed that roughly a third 
of the applications (358 of the 1,100 applications) re- 
quire Internet permissions along with permissions to ac- 
cess either location, camera, or audio data. From this set, 
we randomly selected 30 popular applications (an 8.4% 
sample size), which span twelve categories. Table 2 enu- 
merates these applications along with permissions they 
request at install time. Note that this does not reflect ac- 
tual access or use of sensitive data. 


We studied each of the thirty downloaded applica- 
tions by starting the application, performing any initial- 
ization or registration that was required, and then man- 
ually exercising the functionality offered by the appli- 
cation. We recorded system logs including detailed in- 
formation from TaintDroid: tainted binder messages, 
tainted file output, and tainted network messages with 
the remote address. The overall experiment (conducted 
in May 2010) lasted slightly over 100 minutes, generat- 
ing 22,594 packets (8.6MB) and 1,130 TCP connections. 
To verify our results, we also logged the network traffic 
using tcpdump on the WiFi interface and repeated exper- 
iments on multiple Nexus One phones, running the same 
version of TaintDroid built on Android 2.1. Though the 
phones used for experiments had a valid SIM card in- 
stalled, the SIM card was inactivate, forcing all the pack- 
ets to be transmitted via the WiFi interface. The packet 
trace was used only to verify the exposure of tainted data 
flagged by TaintDroid. 


In addition to the network trace, we also noted whether 
applications acquired user consent (either explicit or im- 
plicit) for exporting sensitive information. This provides 
additional context information to identify possible pri- 
vacy violations. For example, by selecting the “use my 
location” option in a weather application, the user im- 
plicitly consents to disclosing geographic coordinates to 
the weather server. 
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Table 2: Applications grouped by the requested permissions (L: location, C: camera, A: audio, P: phone state). Android 
Market categories are indicated in parenthesis, showing the diversity of the studied applications. 





fy Peemicians’ 
Applications # ermissions 





L Cc A P 





The Weather Channel (News & Weather); Cestos, Solitaire (Game); Movies (Entertainment); | 6 x 
Babble (Social); Manga Browser (Comics) 





Bump, Wertago (Social); Antivirus (Communication); ABC — Animals, Traffic Jam, Hearts, | 14 x x 
Blackjack, (Games); Horoscope (Lifestyle); 3001 Wisdom Quotes Lite, Yellow Pages (Ref- 
erence); Dastelefonbuch, Astrid (Productivity), BBC News Live Stream (News & Weather); 
Ringtones (Entertainment) 





Layer (Productivity); Knocking (Social); Barcode Scanner, Coupons (Shopping); Trapster | 7 x x x 
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(Travel); Spongebob Slide (Game); ProBasketBall (Sports) 





MySpace (Social); ixMAT (Shopping) 








Evernote (Productivity) 























* All listed applications also require access to the Internet. 


Table 3: Potential privacy violations by 20 of the studied applications. Note that three applications had multiple 
violations, one of which had a violation in all three categories. 








Observed Behavior (# of apps) Details 








Phone Information to Content Servers (2) 


2 apps sent out the phone number, IMSI, and ICC-ID along with the 
geo-coordinates to the app’s content server. 





Device ID to Content Servers (7)* 


2 Social, 1 Shopping, 1 Reference and three other apps transmitted 
the IMEI number to the app’s content server. 





Location to Advertisement Servers (15) 





5 apps sent geo-coordinates to ad.qwapi.com, 5 apps to admob.com, 
2 apps to ads.mobclix.com (1 sent location both to admob.com and 
ads.mobclix.com) and 4 apps sent location’ to data.flurry.com. 








* TaintDroid flagged nine applications in this category, but only seven transmitted the raw IMEI without mentioning such practice in the EULA. 


+ To the best of our knowledge, the binary messages contained tainted location data (see the discussion below). 


6.2 Findings 


Table 3 summarizes our findings. TaintDroid flagged 
105 TCP connections as containing tainted privacy sen- 
sitive information. We manually labeled each mes- 
sage based on available context, including remote server 
names and temporally relevant application log messages. 
We used remote hostnames as an indication of whether 
data was being sent to a server providing application 
functionality or to a third party. Frequently, messages 
contained plaintext that aided categorization, e.g., an 
HTTP GET request containing geographic coordinates. 
However, 21 flagged messages contained binary data. 
Our investigation indicates these messages were gen- 
erated by the Google Maps for Mobile [21] and Flur- 
ryAgent [20] APIs and contained tainted privacy sensi- 
tive data. These conclusions are supported by message 
transmissions immediately after the application received 
a tainted parcel from the system location manager. We 
now expand on our findings for each category and reflect 
on potential privacy violations. 


Phone Information: Table 2 shows that 21 out of the 
30 applications require permissions to read phone state 
and the Internet. We found that 2 of the 21 applications 
transmitted to their server (1) the device’s phone num- 
ber, (2) the IMSI which is a unique 15-digit code used to 


identify an individual user on a GSM network, and (3) 
the ICC-ID number which is a unique SIM card serial 
number. We verified messages were flagged correctly by 
inspecting the plaintext payload.* In neither case was the 
user informed that this information was transmitted off 
the phone. 


This finding demonstrates that Android’s coarse- 
grained access control provides insufficient protection 
against third-party applications seeking to collect sensi- 
tive data. Moreover, we found that one application trans- 
mits the phone information every time the phone boots. 
While this application displays a terms of use on first use, 
the terms of use does not specify collection of this highly 
sensitive data. Surprisingly, this application transmits the 
phone data immediately after install, before first use. 


Device Unique ID: The device’s IMEI was also exposed 
by applications. The IMEI uniquely identifies a specific 
mobile phone and is used to prevent a stolen handset 
from accessing the cellular network. TaintDroid flags 
indicated that nine applications transmitted the IMEI. 
Seven out of the nine applications either do not present 
an End User License Agreement (EULA) or do not spec- 
ify IMEI collection in the EULA. One of the seven ap- 
plications is a popular social networking application and 
another is a location-based search application. Further- 
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more, we found two of the seven applications include the 
IMEI when transmitting the device’s geographic coordi- 
nates to their content server, potentially repurposing the 
IMEI as a client ID. 

In comparison, two of the nine applications treat the 
IMEI with more care, thus we do not classify them as 
potential privacy violators. One application displays a 
privacy statement that clearly indicates that the applica- 
tion collects the device ID. The other uses the hash of 
the IMEI instead of the number itself. We verified this 
practice by comparing results from two different phones. 


Location Data to Advertisement Servers: Half of the 
studied applications exposed location data to third-party 
advertisement servers without requiring implicit or ex- 
plicit user consent. Of the fifteen applications, only two 
presented a EULA on first run; however neither EULA 
indicated this practice. Exposure of location informa- 
tion occurred both in plaintext and in binary format. 
The latter highlights TaintDroid’s advantages over sim- 
ple pattern-based packet scanning. Applications sent lo- 
cation data in plaintext to admob.com, ad.qwapi.com, 
ads.mobclix.com (11 applications) and in binary format 
to FlurryAgent (4 applications). The plaintext location 
exposure to AdMob occurred in the HTTP GET string: 


...&S=al4a4a93fle4c68é&. .&t=062A1CB1D476DE85 
B717D9195A6722A9&d%5Bcoord%5D=47. 6612278900 





00006%2C-122.31589477&... 


Investigating the AdMob SDK revealed the s= parameter 
is an identifier unique to an application publisher, and the 
coord= parameter provides the geographic coordinates. 

For FlurryAgent, we confirmed location exposure by 
the following sequence of events. First, a component 
named “FlurryAgent” registers with the location man- 
ager to receive location updates. Then, TaintDroid log 
messages show the application receiving a tainted par- 
cel from the location manager. Finally, the application 
reports “sending report to http://data.flurry. 
com/aar.do” after receiving the tainted parcel. 

Our experimentation indicates these fifteen applica- 
tions collect location data and send it to advertisement 
servers. In some cases, location data was transmitted 
to advertisement servers even when no advertisement 
was displayed in the application. However, we note that 
TaintDroid helped us verify that three of the studied ap- 
plications (not included in the Table 3) only transmitted 
location data per user’s request to pull localized content 
from their servers. This finding demonstrates the impor- 
tance of monitoring exercised functionality of an appli- 
cation that reflects how the application actually uses or 
abuses the granted permissions. 


Legitimate Flags: Out of 105 connections flagged by 
TaintDroid, 37 were deemed clearly legitimate use. The 
flags resulted from four applications and the OS itself 
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while using the Google Maps for Mobile (GMM) API. 
The TaintDroid logs indicate an HTTP request with the 
“User-Agent: GMM ...” header, but a binary pay- 
load. Given that GMM functionality includes download- 
ing maps based on geographic coordinates, it is obvious 
that TaintDroid correctly identified location information 
in the payload. Our manual inspection of each message 
along with the network packet trace confirmed that there 
were no false positives. We note that there is a possibil- 
ity of false negatives, which is difficult to verify with the 
lack of the source code of the third-party applications. 


Summary: Our study of 30 popular applications shows 
the effectiveness of the TaintDroid system in accu- 
rately tracking applications’ use of privacy sensitive data. 
While monitoring these applications, TaintDroid gener- 
ated no false positives (with the exception of the IMSI 
taint source which we disabled for experiments, see Sec- 
tion 8). The flags raised by TaintDroid helped to identify 
potential privacy violations by the tested applications. 
Half of the studied applications share location data with 
advertisement servers. Approximately one third of the 
applications expose the device ID, sometimes with the 
phone number and the SIM card serial number. The anal- 
ysis was simplified by the taint tag provided by Taint- 
Droid that precisely describes which privacy relevant 
data is included in the payload, especially for binary pay- 
loads. We also note that there was almost no perceived 
latency while running experiments with TaintDroid. 


7 Performance Evaluation 


We now study TaintDroid’s taint tracking overhead. 
Experiments were performed on a Google Nexus One 
running Android OS version 2.1 modified for TaintDroid. 
Within the interpreted environment, TaintDroid incurs 
the same performance and memory overhead regardless 
of the existence of taint markings. Hence, we only need 
to ensure file access includes appropriate taint tags. 


7.1 Macrobenchmarks 


During the application study, we anecdotally observed 
limited performance overhead. We hypothesize that this 
is because: 1) most applications are primarily in a “wait 
state,” and 2) heavyweight operations (e.g., screen up- 
dates and webpage rendering) occur in unmonitored na- 
tive libraries. 

To gain further insight into perceived overhead, we 
devised five macrobenchmarks for common high-level 
smartphone operations. Each experiment was measured 
50 times and observed 95% confidence intervals at least 
an order of magnitude less than the mean. In each case, 
we excluded the first run to remove unrelated initializa- 
tion costs. Experimental results are shown in Table 4. 


Application Load Time: The application load time 
measures from when Android’s Activity Manager re- 
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Table 4: Macrobenchmark Results 























Android | TaintDroid 
App Load Time 63 ms 65 ms 
Address Book (create) 348 ms 367 ms 
Address Book (read) 101 ms 119 ms 
Phone Call 96 ms 106 ms 
Take Picture 1718 ms 2216 ms 














ceives a command to start an activity component to the 
time the activity thread is displayed. This time includes 
application resolution by the Activity Manager, IPC, and 
graphical display. TaintDroid adds only 3% overhead, as 
the operation is dominated by native graphics libraries. 


Address Book: We built a custom application to create, 
read, and delete entries for the phone’s address book, ex- 
ercising both file read and write. Create used three SQL 
transactions while read used two SQL transactions. The 
subsequent delete operation was lazy, returning in O ms, 
and hence was excluded from our results. TaintDroid 
adds approximately 5.5% and 18% overhead for address 
book entry creates and reads, respectively. The addi- 
tional overhead for reads can be attributed to file taint 
propagation. The data is not tainted before create, hence 
no file propagation is needed. Note that the user experi- 
ences less than 20 ms overhead when creating or viewing 
a contact. 


Phone Call: The phone call benchmark measured the 
time from pressing “dial” to the point at which the audio 
hardware was reconfigured to “in call” mode. TaintDroid 
only adds 10 ms per phone call setup (~ 10% overhead), 
which is significantly less than call setup in the network, 
which takes on the order of seconds. 


Take Picture: The picture benchmark measures from 
the time the user presses the “take picture” button un- 
til the preview display is re-enabled. This measurement 
includes the time to capture a picture from the camera 
and save the file to the SDcard. TaintDroid adds 498 ms 
to the 1718 ms needed by Android to take a picture (an 
overhead of 29%). A portion of this overhead can be at- 
tributed to to additional file operations required for taint 
propagation (one getxattr/setxattr pair per written data 
buffer). Note that some of this overhead can be reduced 
by eliminating redundant taint propagation. That is, only 
the taint tag for the first data buffer written to file needs to 
be propagated. For example, the current taint tag could 
be associated with the file descriptor. 


7.2 Java Microbenchmark 


Figure 5 shows the execution time results of a Java mi- 
crobenchmark. We used an Android port of the standard 
CaffeineMark 3.0 [43]. CaffeineMark uses an internal 
scoring metric only useful for relative comparisons. 

The results are consistent with implementation- 
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Figure 5: Microbenchmark of Java overhead. Error bars 
indicate 95% confidence intervals. 


specific expectations. The overhead incurred by Taint- 
Droid is smallest for the benchmarks dominated by arith- 
metic and logic operations. The taint propagation for 
these operations is simple, consisting of an additional 
copy of spatially local memory. The string benchmark, 
on the other hand, experiences the greatest overhead. 
This is most likely due to the additional memory com- 
parisons that occur when the JNI propagation heuristic 
checks for string objects in method prototypes. 

The “overall” results indicate cumulative score across 
individual benchmarks. CaffeineMark documentation 
states that scores roughly correspond to the number of 
Java instructions executed per second. Here, the unmod- 
ified Android system had an average score of 1121, and 
TaintDroid measured 967. TaintDroid has a 14% over- 
head with respect to the unmodified system. 

We also measured memory consumption during the 
CaffeineMark benchmark. The benchmark consumed 
21.28 MB on the unmodified system and 22.21 MB while 
running on TaintDroid, indicating a 4.4% memory over- 
head. Note that much of an Android process’s memory 
is used by the zygote runtime environment. These na- 
tive library memory pages are shared between applica- 
tions to reduce the overall system memory footprint and 
require taint tracking. Given that TaintDroid stores 32 
taint markings (4 bytes) for each 32-bit variable in the 
interpreted environment (regardless of taint state), this 
overhead is expected. 


7.3. IPC Microbenchmark 


The IPC benchmark considers overhead due to the par- 
cel modifications. For this experiment, we developed 
client and service applications that perform binder trans- 
actions as fast as possible. The service manipulates ac- 
count objects (a username string and a balance integer) 
and provides two interfaces: setAccount() and getAc- 
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Table 5: IPC Throughput Test (10,000 msgs). 























Android | TaintDroid 
Time (s) 8.58 10.89 
Memory (client) 21.06MB 21.88MB 
Memory (service) | 18.92MB 19.48MB 








count(). The experiment measures the time for the client 
to invoke each interface pair 10,000 times. 

Table 5 summarizes the results of the IPC benchmark. 
TaintDroid was 27% slower than Android. TaintDroid 
only adds four bytes to each IPC object, therefore over- 
head due to data size is unlikely. The more likely cause of 
the overhead is the continual copying of taint tags as val- 
ues are marshalled into and out of the parcel byte buffer. 
Finally, TaintDroid used 3.5% more memory than An- 
droid, which is comparable to the consumption observed 
during the CaffeineMark benchmarks. 


8 Discussion 


Approach Limitations: TaintDroid only tracks data 
flows (i.e., explicit flows) and does not track control 
flows (i.e., implicit flows) to minimize performance over- 
head. Section 6 shows that TaintDroid can track applica- 
tions’ expected data exposure and also reveal suspicious 
actions. However, applications that are truly malicious 
can game our system and exfiltrate privacy sensitive in- 
formation through control flows. Fully tracking control 
flow requires static analysis [14, 37], which is not appli- 
cable to analyzing third-party applications whose source 
code is unavailable. Direct control flows can be tracked 
dynamically if a taint scope can be determined [53]; 
however, DEX does not maintain branch structures that 
TaintDroid can leverage. On-demand static analysis to 
determine method control flow graphs (CFGs) provides 
this context [39]; however, TaintDroid does not currently 
perform such analysis in order to avoid false positives 
and significant performance overhead. Our data flow 
taint propagation logic is consistent with existing, well 
known, taint tracking systems [7, 57]. Finally, once in- 
formation leaves the phone, it may return in a network 
reply. TaintDroid cannot track such information. 


Implementation Limitations: Android uses the Apache 
Harmony [3] implementation of Java with a few custom 
modifications. This implementation includes support for 
the PlatformAddress class, which contains a native ad- 
dress and is used by DirectBuffer objects. The file and 
network IO APIs include write and read “direct” vari- 
ants that consume the native address from a DirectBuffer. 
TaintDroid does not currently track taint tags on Direct- 
Buffer objects, because the data is stored in opaque native 
data structures. Currently, TaintDroid logs when a read 
or write “direct” variant is used, which anecdotally oc- 
curred with minimal frequency. Similar implementation 
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limitations exist with the sun.misc.Unsafe class, which 
also operates on native addresses. 


Taint Source Limitations: While TaintDroid is very ef- 
fective for tracking sensitive information, it causes sig- 
nificant false positives when the tracked information con- 
tains configuration identifiers. For example, the IMSI nu- 
meric string consists of a Mobile Country Code (MCC), 
Mobile Network Code (MNC), and Mobile Station Iden- 
tifier Number (MSIN), which are all tainted together.> 
Android uses the MCC and MNC extensively as con- 
figuration parameters when communicating other data. 
This causes all information in a parcel to become tainted, 
eventually resulting in an explosion of tainted informa- 
tion. Thus, for taint sources that contain configuration 
parameters, tainting individual variables within parcels 
is more appropriate. However, as our analysis results in 
Section 6 show, message-level taint tracking is effective 
for the majority of our taint sources. 


9 Related Work 


Mobile phone host security is a growing concern. 
OS-level protections such as Kirin [18], Saint [42], 
and Security-by-Contract [15] provide enhanced security 
mechanisms for Android and Windows Mobile. These 
approaches prevent access to sensitive information; how- 
ever, once information enters the application, no addi- 
tional mediation occurs. In systems with larger displays, 
a graphical widget [27] can help users visualize sensor 
access policies. Mulliner et al. [36] provide information 
tracking by labeling smartphone processes based on the 
interfaces they access, effectively limiting access to fu- 
ture interfaces based on acquired labels. 

Decentralized information flow control (DIFC) en- 
hanced operating systems such as Asbestos [52] and HiS- 
tar [60] label processes and enforce access control based 
on Denning’s lattice model for information flow secu- 
rity [13]. Flume [30] provides similar enhancements for 
legacy OS abstractions. DEFCon [34] uses a logic simi- 
lar to these DIFC OSes, but focuses on events and modi- 
fies a Java runtime with lightweight isolation. Related to 
these system-level approaches, PRECIP [54] labels both 
processes and shared kernel objects such as the clipboard 
and display buffer. However, these process-level infor- 
mation flow models are coarse grained and cannot track 
sensitive information within untrusted applications. 

Tools that analyze applications for privacy sensi- 
tive information leaks include Privacy Oracle [28] and 
TightLip [59]. These tools investigate applications while 
treating them as a black box, thus enabling analysis of 
off-the-shelf applications. However, this black-box anal- 
ysis tool becomes ineffective when applications use en- 
cryption prior to releasing sensitive information. 

Language-based information flow security [46] ex- 
tends existing programming languages by labeling vari- 
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ables with security attributes. Compilers use the secu- 
rity labels to generate security proofs, e.g., Jif [37, 38] 
and SLam [24]. Laminar [45] provides DIFC guarantees 
based on programmer defined security regions. However, 
these languages require careful development and are of- 
ten incompatible with legacy software designs [25]. 


Dynamic taint analysis provides information track- 
ing for legacy programs. The approach has been used 
to enhance system integrity (e.g., defend against soft- 
ware attacks [41, 44, 8]) and confidentiality (e.g., dis- 
cover privacy exposure [57, 16, 61]), as well as track 
Internet worms [9]. Dynamic tracking approaches 
range from whole-system analysis using hardware exten- 
sions [51, 11, 50] and emulation environments [7, 57] 
to per-process tracking using dynamic binary transla- 
tion (DBT) [6, 44, 8, 61]. The performance and mem- 
ory overhead associated with dynamic tracking has re- 
sulted in an array of optimizations, including optimizing 
context switches [44], on-demand tracking [26] based 
on hypervisor introspection, and function summaries for 
code with known information flow properties [61]. If 
source code is available, significant performance im- 
provements can be achieved by automatically instru- 
menting legacy programs with dynamic tracking func- 
tionality [56, 31]. Automatic instrumentation has also 
been performed on x86 binaries [47], providing a com- 
promise between source code translation and DBT. Our 
TaintDroid design was inspired by these prior works, but 
addressed different challenges unique to mobile phones. 
To our knowledge, TaintDroid is the first taint tracking 
system for a mobile phone and is the first dynamic taint 
analysis system to achieve practical system-wide analy- 
sis through the integration of tracking multiple data ob- 
ject granularities. 


Finally, dynamic taint analysis has been applied to vir- 
tual machines and interpreters. Haldar et al. [22] in- 
strument the Java String class with taint tracking to pre- 
vent SQL injection attacks. WASP [23] has similar mo- 
tivations; however, it uses positive tainting of individ- 
ual characters to ensure the SQL query contains only 
high-integrity substrings. Chandra and Franz [5] pro- 
pose fine-grained information flow tracking within the 
JVM and instrument Java byte-code to aid control flow 
analysis. Similarly, Nair et al. [39] instrument the Kaffe 
JVM. Vogt et al. [53] instrument a Javascript interpreter 
to prevent cross-site scripting attacks. Xu et al. [56] au- 
tomatically instrument the PHP interpreter source code 
with dynamic information tracking to prevent SQL in- 
jection attacks. Finally, the Resin [58] environment for 
PHP and Python uses data flow tracking to prevent an as- 
sortment of Web application attacks. When data leaves 
the interpreted environment, Resin implements filters 
for files and SQL databases to serialize and de-serialize 
objects and policy with byte-level granularity. Taint- 


Droid’s interpreted code taint propagation bears similar- 
ity to some of these works. However, TaintDroid im- 
plements system-wide information flow tracking, seam- 
lessly connecting interpreter taint tracking with a range 
of operating system sharing mechanisms. 


10 Conclusions 


While some mobile phone operating systems allow 
users to control applications’ access to sensitive informa- 
tion, such as location sensors, camera images, and con- 
tact lists, users lack visibility into how applications use 
their private data. To address this, we present TaintDroid, 
an efficient, system-wide information flow tracking tool 
that can simultaneously track multiple sources of sensi- 
tive data. A key design goal of TaintDroid is efficiency, 
and TaintDroid achieves this by integrating four gran- 
ularities of taint propagation (variable-level, message- 
level, method-level, and file-level) to achieve a 14% per- 
formance overhead on a CPU-bound microbenchmark. 

We also used our TaintDroid implementation to study 
the behavior of 30 popular third-party applications, cho- 
sen at random from the Android Marketplace. Our study 
revealed that two-thirds of the applications in our study 
exhibit suspicious handling of sensitive data, and that 15 
of the 30 applications reported users’ locations to remote 
advertising servers. Our findings demonstrate the effec- 
tiveness and value of enhancing smartphone platforms 
with monitoring tools such as TaintDroid. 
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Notes 


1A similar approach can be applied to just-in-time compilation by 
inserting tracking code within the generated binary. 

2Only 11 internal VM methods were added between versions 1.5 
and 2.1 (primarily for debugging and profiling) 

3There was a relatively small number of JNI methods that did not 
either have a Java interface or C/C++ implementation. These unusable 
methods were excluded from our survey. 

“Because of the limitation of the IMSI taint source as discussed in 
Section 8, we disabled the IMSI taint source for experiments. Nonethe- 
less, TaintDroid’s flag of the ICC-ID and the phone number led us to 
find the IMSI contained in the same payload. 

Regardless of the string separation, the MCC and MNC are identi- 
fiers that warrant taint sources. 
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Abstract 


StarTrack was the first service designed to manage tracks 
of GPS location coordinates obtained from mobile de- 
vices and to facilitate the construction of track-based 
applications. Our early attempts to build practical ap- 
plications on StarTrack revealed substantial efficiency 
and scalability problems, including frequent client-server 
roundtrips, unnecessary data transfers, costly similar- 
ity comparisons involving thousands of tracks, and poor 
fault-tolerance. To remedy these limitations, we revised 
the overall system architecture, API, and implementa- 
tion. The API was extended to operate on collections 
of tracks rather than individual tracks, delay query exe- 
cution, and permit caching of query results. New data 
structures, namely track trees, were introduced to speed 
the common operation of searching for similar tracks. 
Map matching algorithms were adopted to convert each 
track into a more compact and canonical sequence of 
road segments. And the underlying track database was 
partitioned and replicated among multiple servers. Al- 
together, these changes not only simplified the construc- 
tion of track-based applications, which we confirmed by 
building applications using our new API, but also re- 
sulted in considerable performance gains. Measurements 
of similarity queries, for example, show two to three or- 
ders of magnitude improvement in query times. 


1 Introduction 


The easy availability of function-rich mobile devices has 
fueled significant interest in the “mobile internet’, where 
mobile devices access internet-based services and web 
applications. Mobile devices that can determine their 
own physical location are adding to this trend by facilitat- 
ing the development of diverse location-based services. 
In addition to individual coordinates, “tracks” — time- 
ordered sequences of GPS locations recorded by mobile 
devices — enable many location-oriented applications, 


varying from personal applications such as trip plan- 
ning and health monitoring, to social applications such 
as ride-sharing and urban sensing. 


StarTrack, introduced in an earlier paper, was the first 
service designed to manage tracks from mobile devices 
and to facilitate the construction of track-based applica- 
tions [3]. That paper was primarily focussed on identi- 
fying a rich class of interesting personal and social ap- 
plications that exploited histories of tracks; not much 
attention was paid to implementing the service at scale 
or building applications. Indeed, the entire implementa- 
tion relied heavily on the services of a single database 
server with a thin software veneer providing an API. 
No applications were built using this API. Our first at- 
tempt to build realistic applications using this system re- 
vealed many shortcomings: principally inadequate per- 
formance, scalability, and fault-tolerance. Some of these, 
e.g. fault-tolerance, arose out of inadequate system struc- 
ture in the original implementation. But by far most of 
the shortcomings arose out of a mismatch between the 
API provided by the system and what was required by 
applications. Specifically, several functions that were 
necessary for applications were either missing in the API 
or needed to be synthesized from lower-level primitives 
of the API. This mismatch led to costly and unneces- 
sary client-server communication and data transfer. In 
addition to these deficiencies, our original system imple- 
mented common operations inefficiently (e.g. track com- 
parisons). 

This paper describes how the design and implementa- 
tion of StarTrack have evolved non-trivially to address 
real-world issues of dealing with tracks. Our experi- 
ence with track-based applications is admittedly limited. 
We do not claim our API is universal or fundamental in 
any sense; it will undoubtedly evolve as we encounter 
new classes of applications that we have not anticipated. 
Nonetheless, we believe our work and experience to date 
will be beneficial to researchers and practitioners in this 
rapidly growing field. 
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In general, we found managing and providing seman- 
tically rich operations on tracks to be surprisingly dif- 
ficult. Track queries are complex because they involve 
geographic and similarity constraints, and a naive solu- 
tion requiring expensive evaluation of these constraints 
does not scale to real-world online demand. 

The main insight we use in tackling the complexity 
of tracks is to recognize that tracks tend to be repetitive. 
Repetitiveness arises from two distinct sources. An indi- 
vidual tends to follow substantially similar routes in his 
day-to-day life. This intuition is supported scientifically 
by a recent study in Science [23]. Second, the vast major- 
ity of tracks are collected on roads and highways, again 
leading to significant overlap in tracks even if they are 
from different users. 

This insight permeates all parts of our revamped Star- 
Track infrastructure. We made several changes to our 
system. In some cases, we needed new techniques and 
data structures; in other cases, we used more established 
techniques, but synthesized in novel ways, to support a 
new class of track-based applications efficiently. 

The changes to our system fall into four broad areas: 


API Changes. All operations in our original API dealt 
with individual tracks, often causing entire sets of tracks 
to be moved repeatedly between the service and applica- 
tions. StarTrack currently supports a “track collection”, 
representing a set of tracks. Several functions in the API 
now operate on and return results as track collections. 
This change had several benefits. Apart from the obvi- 
ous ease of programming, it afforded StarTrack opportu- 
nities to optimize the performance of specific operations 
through delayed and partial evaluation of these collec- 
tions. Caching of both full and partial results also be- 
came possible. 


Changes in Track Representation. We quickly discov- 
ered dealing with “raw” tracks by themselves to be in- 
efficient. We now use a “canonical” representation for 
tracks, where tracks are represented as a sequence of 
points drawn from a fixed set, such as road intersections. 
Canonicalization benefits many aspects of the system. 
It reduces the computational costs of track comparison 
while improving its accuracy. As a consequence of im- 
proved accuracy, we are able to group a user’s similar 
tracks more effectively and maintain a small set of repre- 
sentative tracks that captures the essentials of a large set 
of tracks. Many applications only need to operate on the 
set of representative tracks, leading to significantly fewer 
operations, better caching of data, and consequently, bet- 
ter performance. 


Changes to On-Disk and In-Memory Data Struc- 
tures. The original StarTrack API was implemented as 
a thin veneer on top of a geospatial database system. 
While simplifying the implementation, this resulted in 
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poor performance for many operators. The changes in 
the API and canonicalization described above allowed 
us to build specialized in-memory data structures to aug- 
ment the database tables. Operations that had low per- 
formance are now optimized by using in-memory quad- 
trees or a novel structure called a track tree described in 
Section 3.3. In addition to these in-memory data struc- 
tures, we reorganized the database layout to include a 
table of representative tracks for each user (as mentioned 
above) and other tables that aid in handling operations 
with geographical constraints. 


Structural Changes. Our original prototype consisted 
of a single server process that stored tracks in a central- 
ized database and implemented an API to access these 
tracks. This single server implementation clearly did 
not scale to a large number of tracks or provide fault- 
tolerance. In the new system, a set of StarTrack server 
machines connects to another set of database servers. 
Applications use a StarTrack clerk, which implements 
the API and makes remote procedure calls (RPCs) to the 
StarTrack servers as necessary. It also deals with retrying 
requests on server failures, and balances RPC requests 
amongst servers. 

We detail our changes further in the rest of the paper 
(Sections 2—4), describe two scalable, robust, and effi- 
cient applications they enabled us to build (Section 5) 
and summarize their performance impact (Section 6). 


2 Application Programming Interface 


The interface exported by the StarTrack service has 
undergone multiple revisions based on our experience 
building realistic applications. This section describes the 
key elements of the new application programming inter- 
face; space restrictions prevent us from describing the 
complete API. 


2.1 Track Collections 


The new StarTrack interface supports the notion of a 
track collection, an abstract grouping of tracks, where 
the application supplies the criteria for grouping. Track 
collections can, in turn, participate in other StarTrack op- 
erations. All non-trivial operations in the StarTrack API 
take a track collection as an argument. 

Track collections have two significant advantages: 


Implementation Efficiency. They allow the server to 
treat the set of tracks that are repeatedly accessed to- 
gether as a single entity for the purposes of caching. 
They also allow the server to construct specialized data 
structures that operate exclusively on these tracks, mak- 
ing these operations more efficient. Furthermore, by hav- 
ing applications and the service refer to a potentially 
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large collection of track identifiers by a single identifier, 
we reduce the communication costs of transmitting the 
identities of individual tracks between them. 


Programming Convenience. Applications often want 
to constrain operations to tracks that belong to a particu- 
lar community or cohort. For example, a social applica- 
tion might wish to operate on the tracks of a user and his 
group of friends. Track collections allow such an appli- 
cation to create an aggregation of the tracks in which it is 
interested and enable it to operate on such groups more 
conveniently. 


Track collections are created by using the MakeCol- 
lection procedure (see API Fragment 2.1). MakeCollec- 
tion takes as its first argument a set of criteria to select 
a group of tracks from all tracks in the system. Individ- 
ual criteria can be composed out of three elements: ge- 
ographic, time, user. The first two elements have fairly 
simple semantics: a geographic element is specified by a 
physical geographical region and a time element is spec- 
ified by a time interval. The user element consists of two 
subfields: a unique identifier that specifies the user and a 
string field that specifies an XPATH query. The query is 
applied to the user metadata that is stored in the track by 
the application. 





TrackCollxn MakeCollection(GrpCriteria[] gCrit, 
bool unique) ; 








API fragment 2.1: Operation to create a track collec- 
tion. 


The second argument is a boolean that indicates 
whether the system should return only “unique” tracks. 
Two canonical tracks are considered unique if their start- 
ing points (as well as ending points) are “close” to each 
other, and their paths are highly “similar” to each other. 
Similarity is more precisely defined below when we dis- 
cuss the GetSimilarTracks function. Parameters that de- 
cide if the start/end points are “close” to one another and 
if tracks are highly similar are defined by the infrastruc- 
ture. These are described further in Section 4.1. 

We provide applications the option to specify the 
unique flag for two reasons. People tend to travel the 
same routes habitually, leading to multiple highly similar 
tracks that only differ in time. Meanwhile, many applica- 
tions are only interested in distinct routes without requir- 
ing knowledge of the precise times at which the route was 
traveled. These applications greatly benefit from using 
MakeCollection with the unique flag set to true since it 
significantly reduces the number of tracks in the returned 
collection. If instead an application needs per track infor- 
mation, for instance, if it needs to know how fast the user 
travels on a particular road segment, setting unique to 


false will retrieve all the relevant tracks with detailed in- 
formation. 

Two simple code segments calling MakeCollection are 
shown in Examples 2.1 and 2.2. The first example col- 
lects the tracks of user Uriah between 8AM and 10AM. 
The second shows how metadata information is used to 
create a track collection of all employees of an organiza- 
tion. 


Example 2.1 Uriah’s tracks between 8AM and 10AM. 





GrpCriteria[] gCrit = new GrpCriteria[2]; 
UserCriteria uc = new UserCriteria(); 
uc.Username = "Uriah"; 

TimeCriteria tc = new TimeCriteria(); 
tc.StartHour = 8; tc.EndHour = 10; 
gCrit[0O] = uc; gCrit[1] = tc; 

TrackCollxn tcUriah; 

tcUriah = MakeCollection(gCrit, false); 





Example 2.2 Tracks of all employees of the Wickfield 
corporation. The metadata string is an XPATH query, 
shown here in simplified syntax for formatting reasons. 





GrpCriteria[] gCrit = new GrpCriteria[1]; 
UserCriteria uc = new UserCriteria(); 
uc.metadata = ‘‘Employer = Wickfield’’; 
gCrit[0] = uc; 

TrackCollxn tcWField; 

tcWField = MakeCollection(gCrit, true); 





2.2 Manipulating Tracks 


Tracks can be manipulated in several ways; we describe 
a few representative operations. We have chosen these 
because they embody the most significant changes we 
made to the original prototype. Other operations are es- 
sentially unchanged from our previous API. 

JoinTrkCollections takes two or more track collections 
and creates a new track collection that is the union of all 
the constituent tracks. The second argument allows the 
resulting track collection to retain only unique tracks. 
SortTracks takes a track collection and orders the con- 
stituent tracks in the collection according to one of a set 
of predefined attributes. Examples of attributes we have 
implemented are LENGTH and FREQ, which refer to the 
length of the track and its frequency of occurrence within 
that track collection. 

Many track-based applications need to determine 
whether tracks are similar to one another. Given two 
tracks, we define track similarity as the ratio of the length 
of all the segments that are common to both of them di- 
vided by the length of the union of all segments present 
in either of them (Figure 1(a)). GetSimilarTracks is given 
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(a) (b) (c) 


Figure |: (a) the similarity between tracks A and B is 1 
and between A and D is (11 + lg + 13)/(l, + lo +13 + 
la tls +lg +19), where l; is the length of segment s;; (b) 
A,B,C are the tracks that pass by the areas R, and Ro; 
(c) S is the common segment of A,B,C,D with frequency 
threshold set to 0.6. 


a track collection and a reference track and selects from 
within the collection all tracks that are similar to the ref- 
erence track. The returned track collection is sorted by 
similarity. The degree of similarity is controlled by the 
third parameter. 

Track-based applications can find tracks that pass 
within close proximity of a location by calling GetPass- 
ByTracks. GetPassByTracks is given a track collection 
and an array of Area objects and returns all tracks in the 
collection that pass through all the areas (Figure 1(b)). 

GetCommonSegments takes a track collection and a 
frequency threshold and returns the road segments shared 
by at least that fraction of the tracks in the collection. 
These road segments are merged into the smallest num- 
ber of contiguous routes possible (see Figure 1(c)). This 
operation is useful for the application to retrieve a suc- 
cinct summary of a potentially large set of tracks. 

Tracks within a TrackCollxn object can be re- 
trieved via the following two functions (See API Frag- 
ment 2.3). GetTrackCount returns the number of tracks 
in a track collection, and GetTracks returns count 
tracks beginning at the start location within a track 
collection. 


3 StarTrack Server Design 


This section describes three changes to the StarTrack 
server design that we consider most significant. 


3.1 Canonicalization of Tracks 


In our first implementation, we stored users’ latitude and 
longitude coordinates directly in the system. While this 
design choice was intuitive and useful in some circum- 
stances, it was problematic in many others. Recall that 
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TrackCollxn JoinTrkCollections(TrkCollxn tCs[], 
bool unique) ; 


TrackCollxn SortTracks (TrkCollxn tC, 
SortAttribute attr); 


TrackCollxn GetSimilarTracks (TrkCollxn tC, 
Trk refTrk, float simThresh) ; 


TrackCollxn GetPassByTracks(TrkCollxn tC, 
Area[] areas); 


TrackCollxn GetCommonSegments (TrkCollxn tC, 
float freqThresh) ; 





API fragment 2.2: Operations to manipulate a track col- 
lection. 








int GetTrackCount (TrkCollxn tC); 
Track[] GetTracks(TrkCollxn tC, int start, 
int count); 





API fragment 2.3: Retrieval operations on a track col- 
lection. 


coordinates are samples of a path taken by a user. The 
same path taken by different users may be sampled at 
different points. Also, sampling is inherently error-prone 
due to limitations in current localization techniques [8]. 
For these reasons, two identical paths can lead to widely 
different sampled coordinates, making it difficult to clas- 
sify them as equal. In the new system, we “canonical- 
ize” paths to eliminate spurious variability in the sam- 
pled coordinates. In this context, canonicalization means 
that we convert a path to another path that only passes 
through a set of “standard” points drawn from a (large) 
fixed set. We refer to the portion of the path between two 
such points as a segment. 

There are several methods to canonicalize tracks. One 
intuitive way is to overlay a fixed grid on the geographic 
region and to map each coordinate to a grid intersection 
point. A variation on this technique is to pick a suitably 
weighted interior point within the grid instead of a cor- 
ner. 

A fundamental shortcoming of approaches based on a 
fixed grid is that the grid is artificially created and does 
not adapt to users’ tracks. Grids may be too fine-grained, 
in which case canonicalization provides no benefits, or 
too coarse-grained, in which case important features of 
tracks are lost. 

Instead of using an artificial grid, we can often use the 
more natural and adaptive grid imposed by streets and 
highways. Canonicalizing based on street maps is called 
map matching and is desirable in cases where roadmaps 
of the region exist. A track after canonicalization is 
mapped to a path in the roadmap. A path consists of 
one or more street segments and is stored as a sequence 
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of the endpoints of the segment(s). StarTrack uses a 
map matching approach using hidden Markov models 
designed by Krumm et al. [17, 20]. 

The performance of canonicalization is dependent on 
three factors: the sampling rate of a track (i.e., the num- 
ber of GPS points in the track), the length of the track, 
and the amount of GPS noise introduced into the sam- 
ples. In our system, canonicalization is done offline as a 
pre-processing step. Since the performance of canonical- 
ization is not that critical in our system, we do not present 
detailed results. With some performance tuning, Star- 
Track can canonicalize a track with average trip length 
of about 20 km and 400 GPS samples in under 250 ms. 

Canonicalization has two key advantages that translate 
into performance savings. First, StarTrack can compare 
two segments for equality without using expensive geo- 
graphic constraints. Equality of segments is used within 
the inner loop of the procedure that finds similar tracks, 
which in turn is a very common operation in applications. 
Second, canonicalization tends to create larger numbers 
of identical segments. This often allows us to access and 
manipulate a single representative segment rather than 
dealing with individual segments. It also allows Star- 
Track to identify duplicate tracks more accurately and 
reduces the number of tracks it needs to process for var- 
ious operations. 

Canonicalization based on road networks is appropri- 
ate for regions that have a mature road network and a sta- 
ble map. When road networks are not available, we may 
utilize technologies for constructing road maps from user 
tracks [5, 7]. 


3.2 Delayed Evaluation 


We found that applications typically make several API 
calls to narrow down the set of tracks they want to re- 
trieve. Our implementation of the API therefore delays 
the evaluation of the tracks in a track collection until 
one of the two retrieval functions in API Fragment 2.3 
is called. This technique saves multiple roundtrips be- 
tween the StarTrack clerk and servers. Furthermore, it 
allows the StarTrack server flexibility in the queries it is- 
sues to the database and in the choice of data structures 
it builds for different retrieval operations. 

When a client invokes a MakeCollection operation, the 
client-side stub marshals an efficient description of the 
call arguments and a small integer representing the pro- 
cedure name. We call the resulting structure a descriptor. 
The stub sends the descriptor to the server, which stamps 
it with the current time to capture the database contents at 
that instant and returns it.1 We require that the timestamp 
be in the past with respect to the time on the database 


tThere are well-known ways to avoid this RPC call, but we have 
chosen not to implement them for simplicity. 


server. Assuming that tracks are not deleted from the sys- 
tem, this guarantees that multiple evaluations of a track 
collection will always return the same set of tracks. 

Operations such as JoinTrkCollections, GetPopular- 
Tracks, GetSimilarTracks, and GetPassByTracks create 
compositions of these descriptors (at the client stub) with 
no communication to the server and no additional times- 
tamps. We refer to these compositions as compound de- 
scriptors. These are organized as a tree, with the leaves 
being a simple timestamped descriptor. 

Notice that all descriptors (compound or otherwise) 
contain information about the invoked function and the 
arguments, which together can be used to construct a 
track collection. In this sense they can be viewed as a 
closure [18] or as a specialized form of a logical view 
from the database literature [9]. 

Our use of timestamped descriptors is a tradeoff be- 
tween efficiency and freshness. Timestamps imply that 
the application sees data as it existed in the database at a 
particular point in time, not necessarily the latest data. It 
allows the StarTrack server to cache the contents of the 
database in an in-memory data structure, or discard it at 
will and reevaluate it later, while providing easy to un- 
derstand and consistent semantics to the application. It 
also allows a client to present the descriptor to a differ- 
ent StarTrack server if needed for load-balancing reasons 
or if the original server crashes. Re-evaluating a descrip- 
tor is guaranteed to yield the same result anywhere in the 
system because the operations are deterministic, and the 
timestamp acts as a snapshot of the database (provided 
that tracks are not deleted from the system). If freshness 
is more important for an application, it can recreate the 
track collection as often as needed. 

The evaluation of a descriptor yields different types 
of in-memory data structures. For example, the evalua- 
tion of a descriptor constructed by GetSimilarTracks may 
(but need not) create a data structure called a track tree. 
A descriptor created by GetPassByTracks can result in a 
quad-tree [10]. The results of evaluating other descrip- 
tors are typically stored as a simple set of tracks. 


3.3. Track Tree 


In our experience, when two tracks overlap, they usually 
do so on one or very few contiguous segments. We ex- 
ploit this property to build a hierarchical data structure 
called a track tree, which is used to speed up the retrieval 
of similar tracks. 

Each road segment is represented as a leaf node in a 
track tree. For each leaf node, the track tree records all 
tracks that contain that particular segment. Once all the 
segments in a track collection are stored as leaf nodes, 
pairs of nodes that refer to geographically adjacent seg- 
ments are considered for merging to form interior nodes 
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COO) OO 
Figure 2: The track tree of the set of four tracks shown 


in Figure 1(a). Each node, except for leaf nodes, is anno- 
tated with the set of tracks that contain it. 


of the tree. Whenever there is choice of pairs of nodes to 
merge, the pair that has the highest number of tracks in 
common is picked. This process is continued iteratively 
up the tree. When merging two nodes, all tracks belong- 
ing to both children nodes are included in the parent node 
as well. By this construction, each node in the track tree 
represents a contiguous sequence of road segments. In 
addition, the segment is more likely to be shared by mul- 
tiple tracks. 

Figure 2 shows the track tree for the sample four tracks 
in Figure 1(a). As shown in Figure 1(a), tracks A and 
B are identical and consist of segments S1, S2, S3, S4, 
and $5. Tracks C and D share common segments with 
A and B. Segments shared by larger numbers of tracks 
are favored when merging nodes, which explains why 
segments S1 to S3 are merged together, instead of other 
combinations, such as $2 to S4. Using this tree, tracks 
A and B can be described by one single node (S1-5), 
and tracks C and D can be described by two nodes each: 
Track C by S1-4 and S6-7 and Track D by S1-3 and S8-9. 

Track trees are used to accelerate several API oper- 
ations. In GetCommonSegments, after we identify the 
road segments shared by sufficiently many tracks, as in- 
dicated by the given threshold, we use a track tree to or- 
ganize them into a small number of contiguous tracks. 
This is done by merging up in the tree those nodes corre- 
sponding to these road segments. Given the way a track 
tree is constructed, this usually results in a small number 
of nodes, corresponding to a small number of contiguous 
tracks. 

Another API operation enabled by a track tree is Get- 
SimilarTracks. Implementing this function as a database 
operation is inefficient because there is little match be- 
tween our similarity semantics and the primitives sup- 
ported by spatial databases. 

With a track tree, StarTrack can quickly find a set of 
tracks with a given degree of similarity to a specific track 
T (See Code Segment 3.1). First, StarTrack identifies 
the set of all nodes (interior and leaf) covered by T. In 
order to do this, T is initially broken into smaller seg- 
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ments. StarTrack then identifies the leaf nodes in the 
track tree that correspond to these segments. Next, it 
identifies pairs of adjacent nodes that have a common 
parent node, includes the parent into the set, and iterates 
until no such parent exists. These steps are encapsulated 
in the function Map. 


The GetSimilarTracks operator then sorts the nodes in 
T by decreasing order of length. It sequentially scans 
each node, examining the set of tracks containing it, and 
outputs tracks that are at least simThresh similar to 
the query track. This process stops when it has found 
sufficiently many tracks as defined by the maxCount 
parameter, or when it has examined sufficiently many 
tracks. Recall that the client supplies the simThresh 
parameter (as part of the GetSimilarTracks call), as well 
as the maxCount parameter (as part of the GetTracks 
invocation, which triggers the evaluation of the descrip- 
tor). This process will not produce any false positives 
(i.e., tracks that purport to be similar but are not), but it 
could miss some highly similar tracks. The percentage of 
such misses is quite small when the similarity threshold 
is reasonably high, as our experimental results show (see 
Figure 7(c) in Section 6). 


Code segment 3.1 Pseudo-code for implementing 
GetSimilarTracks using tracktree. 


Track[] GetSimilarTracks (TrackTree trackTree, 
Track T, double simThresh, int maxCount) 
{ 


TrackTreeNode[] nodes = trackTree.Map(T) ; 





SortByDescLength (nodes) ; 


SortedList<Track> results; 
foreach(node in nodes) { 
foreach(candidate in node.tracks) { 
if (T.Similarity (candidate) >=simThresh) 
results .Add(candidate) ; 
examinedt+; 
if ((results.Count>=maxCount) | | 
(examined>=6*maxCount) ) 
return results; 


int examined = 0; 


} 
} 


return results; 





Similar to other in-memory data structures in Star- 
Track, a track tree is cached in memory until evicted 
under the caching policy: LRU in our implementation. 
Since track collections are immutable, we do not update 
data structures during their life time. However, the track 
tree structure allows for efficient insertion of new tracks, 
and whenever a track collection is created by building 
upon an existing track collection, an existing underlying 
track tree may be copied and updated. 
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4 Storage Platform Design 


As previously described, we build and maintain in- 
memory data-structures at the StarTrack servers, and use 
a different set of database servers to store data persis- 
tently. StarTrack always checks if tracks can be found in 
the in-memory data-structures before fetching them from 
the database. 

StarTrack uses Microsoft’s SQL Server 2008, which 
supports the notion of geospatial objects as a funda- 
mental data type. Data is partitioned across multiple 
machines, and partitions are replicated using chained 
declustering [12], which provides the necessary scaling 
properties as well as automatic dynamic load-balancing 
and fault-tolerance. 


4.1 Database Tables 


The principal on-disk data structure consists of 5 tables 
stored in SQL Server. 


User Table. This consists of a set of records for each 
user containing a unique system-assigned user identifier 
and other personal information. 


Track Table. Every track is assigned a unique identi- 
fier, consists of a set of time-stamped latitude and lon- 
gitude coordinates, and is stored in a single row in the 
table. Both the raw and the canonical versions of tracks 
are stored in the same table. 


Representative Track Table. This table maintains a set 
of representative tracks per user and allows StarTrack to 
often avoid searching the larger Track Table. Each record 
stores information related to a single representative track: 
the canonical coordinates, the owner, and a count of how 
many actual instances of this representative track exist in 
the Track Table. Upon insertion of a track into the Track 
Table, StarTrack checks if there exists a representative 
track that matches the new track. If so, the new track 
is not inserted into the Representative Track Table, but 
the count of the matching representative track is incre- 
mented. The count serves as indication of the popularity 
of a given representative track and is used by StarTrack 
operations for ranking purposes. 

Two tracks are considered as matching if their start 
points are within 100 m of each other, if their end points 
are within 100 m of each other, and if the tracks are at 
least 90% similar. The choice of these parameters is fixed 
by the infrastructure and cannot be changed by individual 
applications. It is based on expected errors in GPS mea- 
surements, as well as cost/benefit tradeoffs, and is not as 
Procrustean as one might imagine. The values chosen 
determine the size of the Representative Track table — 
high start/end point buffer values and low track similar- 
ity values result in a smaller table of unique tracks, but 


applications may lose the ability to discriminate between 
tracks. The size of the table, in turn, affects the speed of 
many functions in the API that must access that table. 


Coordinate Table. During the map matching process, 
the set of coordinates in a path is drawn from a finite list 
of points, which depends on the particulars of the map 
data used for canonicalization of tracks. Each record in 
this table maps a location identifier to a pair of coordi- 
nates. This particular table is immutable, replicated on 
each database server, and not partitioned. 


Coordinate to Track Table. This table maps coordi- 
nates to tracks that go through them. We use it to speed 
up the location of tracks that pass through certain geo- 
graphic boundaries. 


StarTrack allows three types of criteria in fetching 
tracks from the database: user, time, and geographic 
region. Region-based queries may be performed by 
leveraging the geospatial functions provided by modern 
database systems, which support specialized indexing 
schemes. Such systems must be used with care because 
costs are still significant when indexing large numbers of 
complex geospatial objects such as tracks. 

In the original StarTrack implementation, we used the 
geospatial primitives of the database to treat each track as 
a separate object and created a geospatial index over all 
such objects. Now, we maintain a geospatial index on 
the Coordinate Table alone, thereby reducing the number 
of objects on which the geospatial index is maintained. 
We use this index to find all locations that match a given 
geographic query. We then use the Coordinate To Track 
Table to look up all tracks that go through these locations. 
This is feasible precisely because of the canonicalization 
pre-processing step. 

The Coordinate Table and its geospatial index are 
maintained by the database server and portions of them 
may be cached in memory. We present a comparison 
of the original and new approaches in Section 6.2 (Fig- 
ure 3). If necessary we can further speed up our design 
by not storing the Coordinate Table in the database server 
and can instead store it in memory and index it using an 
in-memory quad-tree. 


4.2 Database Server Organization 


The tables mentioned above are partitioned across multi- 
ple database servers. Based on StarTrack’s search crite- 
ria options, we considered two partitioning schemes: by 
geography and by user identifier. 

We decided not to partition by geography, since over 
time it would lead to increasing numbers of tracks 
that span geographic regions, therefore having to span 
servers. 
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We opted for partitioning data by user identifier, keep- 
ing all data referring to a single user in a single database 
server. This organization allows user-constrained queries 
to be sent to a single database server, while requiring ge- 
ographic queries to be sent to all database servers. 

Data is mirrored in the system. Each database server 
acts as the primary for one partition of each table, and 
as the mirror (or secondary) for its neighbors’ partitions. 
A primary database server processes read and write re- 
quests from clients, while a mirror server only handles 
read requests. 

StarTrack servers are clients of the database servers, 
and evenly distribute reads amongst the replicas. When 
a database server fails, the server that mirrors the parti- 
tions on the failed server takes over as primary for the 
partitions. The StarTrack servers direct write traffic to 
the new servers and in addition, distribute the read re- 
quests uniformly among all the replicas using chained 
declustering, as described by others [12, 19]. 


5 Applications 


We explored scenarios where a single user’s data can be 
used to personalize her experience based on her habit- 
ual tracks, for applications such as personalized adver- 
tising, recommendation systems, and health monitoring. 
On the other end of the spectrum, social applications, 
where the set of tracks from a group of friends or even a 
broader community are used, may help provide enhanced 
services to users. Examples include those related to ur- 
ban sensing, collaboration, discovery of new areas, and 
shared experiences. 

To illustrate the usefulness and evaluate the perfor- 
mance of StarTrack services, we describe two of the ap- 
plications we built. 

While both applications were non-trivial to write, the 
use of our API significantly simplified their construction. 
In fact, the application logic in both examples is suc- 
cinctly captured in a few code snippets. Our general ex- 
perience is that StarTrack provides an intuitive, flexible, 
and efficient way to program track-based applications. 


5.1 Ride-Sharing Service 


Ride-sharing has long held the promise of reducing en- 
ergy consumption. Transit departments in many major 
metropolitan areas now offer on-line ride-sharing ser- 
vices or portals (see for example, King County Metro 
Ride [15]). One challenge in building an effective ride- 
sharing service is to discover ride-share partners who 
travel on similar routes. 

With StarTrack, these ride-matching services are eas- 
ily built. The service can build a TrackCollection for 
the employees of the same company or for a person’s 
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social network, or for a group of people who have sub- 
scribed to a transit service. Code Segment 5.1 constructs 
a track collection for a community of users. Code Seg- 
ment 5.2 identifies potential ride-sharing partners based 
on the similarity of their travel patterns. 


Code segment 5.1 Set up a community’s regularly tra- 
versed tracks where the community is defined through 
supplied SearchCriteria. 


TrackCollxn getCommunityTracks(SearchCriteria sc, 
int count) 





{ 
TrackCollxn tc = MakeCollection(sc, true); 
return Take(SortTracks (tc, FREQ), count); 





Code segment 5.2 Find ride-share candidates with sim- 
ilar travel patterns. findOwners is a client-side func- 
tion that takes a set of tracks and returns the list of users 


who own them. 
List getRideShareCandidates 
(TrackCollxn communityTC, string username) 





{ 


UserCriteria uc = new UserCriteria(); 
uc.Username = username; 
TrackCollxn userTC = 
MakeCollection(uc, true); 
Track[] popularTracks = 
GetTracks (SortTracks(userTC, FREQ), 
0, 10); 
List<TrackCollxn> similarTC; 
foreach(Track track in popularTracks) { 
TrackCollxn tc = GetSimilarTracks ( 
communityTC, track, 0.7); 
similarTC.Add (tc) ; 
} 
Track[] similarTracks = 
GetTracks (JoinTrackCollections (similarTC) 
0, 100); 
return findOwners (similarTracks) ; 





Another usage scenario is when a user needs a ride 
between two specific locations. This can be done easily 
by calling GetPassbyTracks. 

It is important to note that the ride-sharing service 
based on StarTrack offers more flexibility than conven- 
tional services. For instance, since a rider’s entire route 
is known, rather than just his start and destination, it al- 
lows the service more latitude in arranging pick-ups and 
drop-offs along the route. 


5.2 Personalized Driving Directions 


Current navigation systems and online map services pro- 
vide detailed turn-by-turn driving directions. Because 
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StarTrack knows what routes a person has taken in the 
past, as well as how recently and how frequently, an ap- 
plication could easily use StarTrack to provide personal- 
ized driving directions. 

For example, instead of providing detailed turn-by- 
turn instructions on how to get to the freeway from the 
person’s house, the directions might simply say “Get on 
Highway 101 heading south” and then provide detailed 
directions from that point. 


Code segment 5.3 Construct a user’s familiar segments. 


TrackCollxn getFamiliarSegments (string username) 


{ 





UserCriteria uc = new UserCriteria(); 
uc.Username = username; 

TrackCollxn uTC = MakeCollection(uc, true); 
// Pick the 10 most frequently occurring 
// tracks. 

TrackCollxn pplrTC = 


Take (SortTracks (uTC, FREQ), 10); 
TrackCollxn familiarTC = 
GetCommonSegments(pplrTC, 0.2); 


return familiarTC; 





The application we built uses the Bing Map service 
and the StarTrack infrastructure. A user inputs start and 
destination locations, and the application uses Bing to 
get turn-by-turn directions for that route. Next, the appli- 
cation uses StarTrack to obtain the set of “familiar seg- 
ments” for that user, as shown in Code Segment 5.3. 

Having obtained the familiar segments for the user, the 
application identifies portions of the route returned by 
Bing that overlap with the familiar segments and uses 
the result to prepare personalized driving directions (we 
omit further description of these steps given that they are 
performed locally by the application and do not involve 
calls to StarTrack). 


6 Evaluation 


This section evaluates the performance of the StarTrack 
service. To study the system at scale, we used synthet- 
ically generated tracks. We also ran experiments with 
actual tracks collected by users of GPS-equipped mo- 
bile devices, but omit the results since they are similar 
to those performed with synthetic tracks, and given that 
we only have a limited number of real tracks. 

We focus on the costs of executing track operations 
that involve (a) geographic constraints and (b) compar- 
isons of tracks. These operations are the most difficult to 
build efficiently, and are also among the most commonly 
occurring in the track-based applications that we built. 
We also report on the performance of two applications. 

Our experiments were all conducted on 2.6 GHz AMD 


Opteron quad-core processors with 16 GB memory, run- 
ning Windows Server 2003. 


6.1 Synthetic Tracks 


We generated synthetic tracks based on the salient fea- 
tures observed in a dataset of approximately 16,000 real 
tracks followed by 252 users over 2-week periods in 
Seattle, WA [16]. In our model, each person has fixed 
locations for home and workplace, and a number of “er- 
rand” locations that represent places they go less fre- 
quently. On weekdays, a person travels between the 
assigned home and work locations during the common 
morning and evening commute hours. Sporadically on 
weekdays and more often on weekends, a person carries 
out a number of errands. 

After choosing the start and end locations for each trip, 
we calculate the shortest path as well as its duration be- 
tween these points on a graph of road networks. We then 
sample and perturb each path to simulate noise in the 
sampling and localization of the data and treat the result- 
ing points as a track. 

Our early experiments indicated that some features of 
tracks have a pronounced effect on performance while 
others do not. Specifically, performance is affected by 
the following: 


e Number of tracks. The larger the number of tracks, 
the greater the computational and storage overhead. 


e Length of tracks. The number of points in a track 
has an impact on performance. Assuming tracks are 
canonicalized, the number of points is proportional 
to the length of the tracks. 


e Covered region. The region over which the tracks 
are generated has an impact on track density (i.e., 
number of tracks that pass through a unit area). As 
track density increases, the computational burden 
imposed on our algorithms increases. For example, 
the same geographic query returns more results and 
therefore incurs more computational cost when the 
density of tracks is higher. 


We devised our model to allow us to control these key 
features. Our belief is that, at least for the purpose of per- 
formance evaluation, any model that allowed these fea- 
tures of tracks to be varied would be adequate. 

For our scalability experiments, we generated syn- 
thetic tracks for a 3-month period and 18,000 users in 
Santa Clara County. This resulted in a total of over 4.5 
million tracks. On average, each track is 20 km long 
and contains 400 GPS samples that yield on average 163 
points after canonicalization. 
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6.2 Performance of Geographic Queries 


One of StarTrack’s most important operations is query- 
ing based on geographic constraints. Some of these op- 
erations require a round-trip to the database server, while 
others can be optimized by an in-memory cache. In our 
API, geographic queries show up in two forms. First, in 
MakeCollection an application can specify a geographic 
region constraint. Second, GetPassByTracks allows an 
application to select those tracks in a track collection 
that pass within specific areas. The first query involves 
retrieving tracks from a database, while the second in- 
volves retrieving tracks from a pre-computed track col- 
lection, which can be sped up in memory. 


Geographic queries to the database. Although we do 
not focus on studying the performance of the spatial fea- 
tures of the database, we investigate how best to use them 
to improve simple geographic queries used to pre-filter 
tracks brought into memory. 

We compared two ways to store tracks and construct 
the necessary indices. In the first approach, used in 
our original prototype, we treat each track as a sepa- 
rate geospatial object and create a spatial index over all 
tracks. This index is used to retrieve all the tracks inter- 
secting the query region. The second approach, used by 
StarTrack, involves the use of two additional tables, the 
Coordinate Table and the Coordinate to Track Table, as 
described in Section 4.1. In this approach, a spatial index 
is built only on the Coordinate Table. 
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Figure 3: Query time with and without the Coordinate 
Table when searching for tracks that intersect square re- 
gions of increasing side lengths. Secondary y-axis shows 
average number of tracks matched. 


Figure 3 presents the query time for both approaches 
when we vary the area of the query region on a set of 
100,000 tracks. It also shows the average number of 
matched tracks on the secondary y-axis. Isolating the 
need to execute geographic queries to a small set of dis- 
tinct points through the use of the Coordinate Table leads 
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Figure 4: Memory usage and construction time of the 
quad-tree for different sizes of tracks. 


to significant performance benefits. This enhancement is 
only possible due to track canonicalization. 


Geographic queries to in-memory data structure. Re- 
call from Section 3.2 that the evaluation of a GetPass- 
ByTracks operation triggers the construction of an in- 
memory quad-tree, in the expectation that the data will be 
repeatedly accessed in the future. Canonicalization tends 
to lower the number of unique coordinates in tracks, 
speeding up the construction time for quad-trees, as well 
as the execution time of subsequent requests against it. 
Figure 4 shows the cost of constructing a quad-tree. 
Building the quad-tree itself requires little space and time 
since the number of unique coordinates is small and lev- 
els off when the tracks cover a large region. Both the 
memory and time needed are linear in the number of 
tracks, and are mostly spent on building an index from 
coordinates to their containing tracks. 

Figure 5 presents the time to query a quad-tree with 
varying numbers of tracks and region sizes. In all cases, 
the query time is very low. For example, it takes about | 
ms for a region with a 5 km radius on 100,000 tracks. The 
query time is fairly insensitive to the number of tracks 
because the structure of the quad-tree is determined by 
the unique coordinates. On the other hand, the size of 
the query region affects the times since it determines the 
number of quad-tree cells to be visited. 


6.3 Performance of Track Comparisons 


A common query in track based applications is to retrieve 
tracks based on similarity. Typically, an application has 
a track collection and a “query” track and needs to find 
tracks in the set that are most similar to the query track. 
We compare the performance of our technique us- 
ing a track tree to three alternative methods for ranking 
tracks based on similarity: (1) Bruteforce: The brute- 
force method compares the query track against every 
track in the collection and returns those with similarity 
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Figure 5: Time to query a quad-tree. (a) Query time for 
different numbers of tracks when size of the region is 
fixed to 1 and 5 km, respectively. (b) Query time for 50K 
and 100K tracks as the size of region is varied. 


above a given threshold. For the bruteforce method, we 
assume all tracks are already in memory. (2) In-memory 
filtering: This method constructs an in-memory dictio- 
nary used to quickly look up tracks that contain any given 
point. For a given query track, we use this dictionary to 
identify all tracks that intersect it, after which we com- 
pute the similarity of each intersecting track to the query 
track, returning those above the threshold. (3) Database 
filtering: We store the set of tracks in the database, use a 
query to retrieve all tracks in the database that intersect 
the query track, and compute the similarity against the 
retrieved tracks. 


We ran experiments with different numbers of tracks 
and queries with varying similarity thresholds. 


Figure 6 shows the query time when using the vari- 
ous methods. The query time with the track tree method 
is dependent on the similarity threshold, unlike with the 
other three alternatives. In Figure 6, we present results 
for the track tree approach when the similarity threshold 
is 0.7 and 0.9. The experiments show that track trees lead 
to significantly more efficient queries when compared to 
the bruteforce method, achieving two to three orders of 
magnitude speedups. Although the in-memory filtering 


method performs better than the bruteforce method, it is 
still significantly slower while consuming high amounts 
of resources for constructing and storing the in-memory 
dictionary. The database filtering method presented the 
worst performance. 
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Figure 6: Query time comparison between StarTrack 
and three alternative methods. For StarTrack, results are 
shown for similarity thresholds of 0.7 and 0.9. 


There is a cost associated with constructing a track tree 
that is at the heart of our technique. Figure 7(a) shows 
the memory usage and the time for constructing a track 
tree as a function of the number of tracks in the collec- 
tion. Constructing a track tree takes linear space and 
slightly super-linear time as the height of the track tree 
grows logarithmically with the number of tracks. There 
is a tradeoff for using a track tree— it takes time to con- 
struct it, but once constructed, it leads to significantly 
optimized queries. From Figures 7(a) and 6, we cal- 
culate the “break-even” point, or the minimum number 
of queries such that the amortized query time using a 
track tree is lower than the query time of the bruteforce 
method. These break-even numbers are shown in Fig- 
ure 7(b). As observed, the numbers grow slowly with the 
number of tracks, and are fairly small: below 80 for a 
track collection with up to 100,000 tracks. 


One potential downside of the track tree approach is 
that while it is highly efficient at retrieving similar tracks 
and although it will never return tracks that do not sat- 
isfy the similarity threshold, it may not return all tracks 
above the given similarity threshold. Figure 7(c) shows 
the coverage of the track tree method. The graph shows 
the percentage of the expected tracks returned when us- 
ing a track tree. We can see that the coverage increases 
for higher similarity thresholds. It returns over 90% of 
the tracks when similarity is above 0.7. We believe this 
is sufficient for typical applications, that are only inter- 
ested in tracks with reasonably high similarity. 
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Figure 7: (a) Memory and processing time required for constructing a track tree. (b) Break-even number for use of a 
track tree. (c) Coverage of track tree approach as function of the similarity threshold (for 10K, 50K and 100K tracks). 


6.4 Application Performance 


We use the Ride-Sharing (RS) and Personalized Driving 
Directions (PDD) applications, presented in Section 5, 
to evaluate the overall performance of StarTrack. These 
two applications illustrate two different usage scenarios: 
RS creates a large track collection for repeated accesses 
while PDD creates many small per-user track collections. 

We fixed the number of database servers to three and 
varied the number of StarTrack servers. To generate load 
on the servers, we ran multiple instances of these appli- 
cations from a number of client machines. 


6.4.1 Single StarTrack Server Experiments 


The RS application identifies potential ride-sharing part- 
ners for a given user, and as presented in Code Seg- 
ment 5.2, involves multiple calls to the StarTrack server. 
In our evaluation, we built a track collection with 50,000 
unique tracks from which the application searches for 
similar tracks. We warmed up the server by construct- 
ing a track tree on the large set of tracks before sending 
it client requests. Figure 8(a) shows the response times 
for RS under varying request rates. Despite the more 
complex nature of the application, one StarTrack server 
is capable of satisfying 30 requests per second with a re- 
sponse rate of around 150 ms. 

We ran experiments for the PDD application under two 
different types of load. In the first case, queries simulate 
users whose data has not been cached on the StarTrack 
server prior to the query. In the second case, we preload 
the cache with the in-memory data structures used to ex- 
pedite the GetCommonSegments operation (familiarTC 
in Code Segment 5.3) invoked by the application. 

Figures 8(b) and (c) plot the response times with vary- 
ing request rates under the two types of loads. When 
the data is not cached, each server is capable of satisfy- 
ing up to 30 requests per second without increasing the 
response time. The average response time prior to satura- 
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tion is around 100 ms. The maximum server throughput 
increases to 270 requests per second and the response 
time falls to 60 ms when the data is previously cached on 
the server. 


6.4.2 Scalability Experiments 


For both applications, individual requests sent by the 
clients are entirely independent of one another. We tested 
StarTrack’s scalability by running the PDD application 
on multiple StarTrack servers. For this experiment we 
used the non-cached version of PDD, with the goal of 
exercising load on the database. 


In Figure 9 we present the maximum throughput that 
the system is able to achieve with a varying number of 
StarTrack servers. As expected, the system scales lin- 
early with the number of servers. Since PDD only re- 
trieves a small number of tracks for each user, this exper- 
iment did not saturate the database servers. 


From these experiments, we estimate the resources 
needed to satisfy a given number of users for our tested 
applications. Three StarTrack servers can support a peak 
load of around 120 requests per second (without caching) 
or up to 780 (with caching). Without caching, this allows 
over 5 million queries uniformly distributed over a period 
of 12 hours, corresponding to an average of 5 queries per 
user given a population of 1 million users requesting per- 
sonalized driving directions. 


In the case of ride-sharing, it’s desirable that track 
trees are pre-built and kept in memory. In order to cre- 
ate and cache a single or multiple track trees with each 
user’s top 5 tracks, a ride-sharing application satisfying 
1 million users would require approximately 10 GB of 
memory. A single server holding all this data could allow 
a peak load of 35 requests per second, or more servers 
could be used if higher peak loads need to be handled. 
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Figure 8: Response times for the RS and PDD applications under varying request rates. (a) RS application; (b) PDD 
where users’ tracks are not cached; (c) PDD where users’ tracks are previously cached. 
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Figure 9: Maximum aggregate request rate with increas- 
ing numbers of StarTrack servers. 


7 Related Work 


As mobile devices have become equipped with the abil- 
ity to determine their own location, there has been an 
emergence of applications that collect and utilize users’ 
location data. The research community has proposed 
a number of useful location-based applications. Traffic 
prediction [11, 24], ride-sharing [14], personalized driv- 
ing directions [21] and electronic tour guides [1, 25] are 
some compelling examples. 

At present, every application is forced to maintain its 
own silo of user location data. StarTrack addresses this 
problem by providing a common infrastructure that col- 
lects location information and enables access to it by 
multiple applications. In recent years, a number of data 
platforms (such as Twitter and Facebook) have emerged 
that enable sharing of information between users. These 
platforms provide external application developers with 
an API for accessing user information. StarTrack can be 
thought of as a platform that stores and enables access to 
the tracks traversed by users in their daily lives. 

Efficient collection of location data is an important 
precursor to organizing this data and making it acces- 
sible. The CarTel project [13] is a distributed sensor net- 
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work that supports data collection from mobile phones 
and vehicular sensor networks. CarTel allows applica- 
tions to visualize traces stored in a relational database 
using spatial queries. 

Database researchers have extensively studied the 
problem of storing, indexing, and retrieving trajectories. 
A trajectory is similar to a track in our system and is 
modeled as a geometric object with 3 dimensions: two 
for geographical location and a third for time. Prior work 
has focused on range queries on trajectories and has led 
to novel indexing techniques. For example, research has 
shown that it is more efficient to separate the spatial and 
temporal dimensions and to first index the spatial dimen- 
sions [6]. There is also research that optimizes storage 
and query costs when trajectories are drawn from a fixed 
road network [2, 4, 22]. Some of the design decisions in 
StarTrack are based on similar observations. StarTrack 
additionally allows tracks with very similar geometries 
to be pruned, resulting in even greater savings. Fur- 
thermore, StarTrack exploits the repetitiveness in users’ 
tracks drawn from a road map to implement efficient sim- 
ilarity and common segment queries, which are not stud- 
ied in previous work. 


8 Conclusion 


StarTrack enables a broad class of track-based applica- 
tions, involving both individual users and social network- 
ing groups. Our original design of the StarTrack platform 
focused almost exclusively on the set of operations that 
would be useful to application developers and ignored 
performance and scalability considerations. Significant 
work went into revising the StarTrack design and imple- 
mentation to enhance its efficiency, robustness, scalabil- 
ity, and ease of use. In some cases, we were able to apply 
well-known techniques, such as vertical data partitioning 
and chained declustering. However, most of the observed 
improvements come from innovative data structures like 
track trees, new representations for canonicalized tracks, 
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and novel uses of delayed execution and caching. 

The end result is a track-based service that shows sev- 
eral orders of magnitude improvement in performance 
for operations that are commonly used in the applications 
that we have developed. This allows such applications 
to meet their scalability requirements. Moving forward, 
we plan to build and deploy additional track-based ap- 
plications to further validate the practical utility of our 
redesigned service. 
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Abstract 


In classical machine virtualization, a hypervisor runs 
multiple operating systems simultaneously, each on its 
own virtual machine. In nested virtualization, a hypervi- 
sor can run multiple other hypervisors with their associ- 
ated virtual machines. As operating systems gain hyper- 
visor functionality—Microsoft Windows 7 already runs 
Windows XP in a virtual machine—nested virtualization 
will become necessary in hypervisors that wish to host 
them. We present the design, implementation, analysis, 
and evaluation of high-performance nested virtualization 
on Intel x86-based systems. The Turtles project, which 
is part of the Linux/KVM hypervisor, runs multiple un- 
modified hypervisors (e.g., KVM and VMware) and op- 
erating systems (e.g., Linux and Windows). Despite the 
lack of architectural support for nested virtualization in 
the x86 architecture, it can achieve performance that is 
within 6-8% of single-level (non-nested) virtualization 
for common workloads, through multi-dimensional pag- 
ing for MMU virtualization and multi-level device as- 
signment for I/O virtualization. 


The scientist gave a superior smile before re- 
plying, “What is the tortoise standing on?” 
“You’re very clever, young man, very clever”, 
said the old lady. “But it’s turtles all the way 
down!”! 


1 Introduction 


Commodity operating systems increasingly make use 
of virtualization capabilities in the hardware on which 
they run. Microsoft’s newest operating system, Win- 
dows 7, supports a backward compatible Windows XP 
mode by running the XP operating system as a virtual 
machine. Linux has built-in hypervisor functionality 
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via the KVM [29] hypervisor. As commodity operat- 
ing systems gain virtualization functionality, nested vir- 
tualization will be required to run those operating sys- 
tems/hypervisors themselves as virtual machines. 

Nested virtualization has many other potential uses. 
Platforms with hypervisors embedded in firmware [1,20] 
need to support any workload and specifically other hy- 
pervisors as guest virtual machines. An Infrastructure- 
as-a-Service (IaaS) provider could give a user the ability 
to run a user-controlled hypervisor as a virtual machine. 
This way the cloud user could manage his own virtual 
machines directly with his favorite hypervisor of choice, 
and the cloud provider could attract users who would like 
to run their own hypervisors. Nested virtualization could 
also enable the live migration [14] of hypervisors and 
their guest virtual machines as a single entity for any 
reason, such as load balancing or disaster recovery. It 
also enables new approaches to computer security, such 
as honeypots capable of running hypervisor-level root- 
kits [43], hypervisor-level rootkit protection [39,44], and 
hypervisor-level intrusion detection [18, 25]—for both 
hypervisors and operating systems. Finally, it could also 
be used for testing, demonstrating, benchmarking and 
debugging hypervisors and virtualization setups. 

The anticipated inclusion of nested virtualization in 
x86 operating systems and hypervisors raises many in- 
teresting questions, but chief amongst them is its runtime 
performance cost. Can it be made efficient enough that 
the overhead doesn’t matter? We show that despite the 
lack of architectural support for nested virtualization in 
the x86 architecture, efficient nested x86 virtualization— 
with as little as 6-8% overhead—is feasible even when 
running unmodified binary-only hypervisors executing 
non-trivial workloads. 

Because of the lack of architectural support for nested 
virtualization, an x86 guest hypervisor cannot use the 
hardware virtualization support directly to run its own 
guests. Fundamentally, our approach for nested virtual- 
ization multiplexes multiple levels of virtualization (mul- 
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tiple hypervisors) on the single level of architectural sup- 
port available. We address each of the following ar- 
eas: CPU (e.g., instruction-set) virtualization, memory 
(MMU) virtualization, and I/O virtualization. 

x86 virtualization follows the “trap and emulate” 
model [21,22,36]. Since every trap by a guest hypervisor 
or operating system results in a trap to the lowest (most 
privileged) hypervisor, our approach for CPU virtualiza- 
tion works by having the lowest hypervisor inspect the 
trap and forward it to the hypervisors above it for emula- 
tion. We implement a number of optimizations to make 
world switches between different levels of the virtualiza- 
tion stack more efficient. For efficient memory virtual- 
ization, we developed multi-dimensional paging, which 
collapses the different memory translation tables into the 
one or two tables provided by the MMU [13]. For effi- 
cient I/O virtualization, we bypass multiple levels of hy- 
pervisor I/O stacks to provide nested guests with direct 
assignment of I/O devices [11, 31, 37,52, 53] via multi- 
level device assignment. 

Our main contributions in this work are: 


e The design and implementation of nested virtual- 
ization for Intel x86-based systems. This imple- 
mentation can run unmodified hypervisors such as 
KVM and VMware as guest hypervisors, and can 
run multiple operating systems such as Linux and 
Windows as nested virtual machines. Using mullti- 
dimensional paging and multi-level device assign- 
ment, it can run common workloads with overhead 
as low as 6-8% of single-level virtualization. 


The first evaluation and analysis of nested x86 virtu- 
alization performance, identifying the main causes 
of the virtualization overhead, and classifying them 
into guest hypervisor issues and limitations in the 
architectural virtualization support. We also sug- 
gest architectural and software-only changes which 
could reduce the overhead of nested x86 virtualiza- 
tion even further. 


2 Related Work 


Nested virtualization was first mentioned and theoreti- 
cally analyzed by Popek and Goldberg [21, 22,36]. Bel- 
paire and Hsu extended this analysis and created a formal 
model [10]. Lauer and Wyeth [30] removed the need for 
a central supervisor and based nested virtualization on 
the ability to create nested virtual memories. Their im- 
plementation required hardware mechanisms and corre- 
sponding software support, which bear little resemblance 
to today’s x86 architecture and operating systems. 
Belpaire and Hsu also presented an alternative ap- 
proach for nested virtualization [9]. In contrast to today’s 
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x86 architecture which has a single level of architectural 
support for virtualization, they proposed a hardware ar- 
chitecture with multiple virtualization levels. 

The IBM z/VM hypervisor [35] included the first prac- 
tical implementation of nested virtualization, by making 
use of multiple levels of architectural support. Nested 
virtualization was also implemented by Ford et al. in a 
microkernel setting [16] by modifying the software stack 
at all levels. Their goal was to enhance OS modularity, 
flexibility, and extensibility, rather than run unmodified 
hypervisors and their guests. 

During the last decade software virtualization tech- 
nologies for x86 systems rapidly emerged and were 
widely adopted by the market, causing both AMD and 
Intel to add virtualization extensions to their x86 plat- 
forms (AMD SVM [4] and Intel VMX [48]). KVM [29] 
was the first x86 hypervisor to support nested virtualiza- 
tion. Concurrent with this work, Alexander Graf and Jo- 
erg Roedel implemented nested support for AMD pro- 
cessors in KVM [23]. Despite the differences between 
VMX and SVM—VMxX takes approximately twice as 
many lines of code to implement—nested SVM shares 
many of the same underlying principles as the Turtles 
project. Multi-dimensional paging was also added to 
nested SVM based on our work, but multi-level device 
assignment is not implemented. 

There was also a recent effort to incorporate nested 
virtualization into the Xen hypervisor [24], which again 
appears to share many of the same underlying principles 
as our work. It is, however, at an early stage: it can only 
run a single nested guest on a single CPU, does not have 
multi-dimensional paging or multi-level device assign- 
ment, and no performance results have been published. 

Blue Pill [43] is a root-kit based on hardware virtual- 
ization extensions. It is loaded during boot time by in- 
fecting the disk master boot record. It emulates VMX 
in order to remain functional and avoid detection when a 
hypervisor is installed in the system. Blue Pill’s nested 
virtualization support is minimal since it only needs to 
remain undetectable [17]. In contrast, a hypervisor with 
nested virtualization support must efficiently multiplex 
the hardware across multiple levels of virtualization deal- 
ing with all of CPU, MMU, and I/O issues. Unfortu- 
nately, according to its creators, Blue Pill’s nested VMX 
implementation can not be published. 

ScaleMP vSMP is acommercial product which aggre- 
gates multiple x86 systems into a single SMP virtual ma- 
chine. ScaleMP recently announced a new “VM on VM” 
feature which allows running a hypervisor on top of their 
underlying hypervisor. No details have been published 
on the implementation. 

Berghmans demonstrates another approach to nested 
x86 virtualization, where a software-only hypervisor is 
run on a hardware-assisted hypervisor [12]. In contrast, 
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our approach allows both hypervisors to take advantage 
of the virtualization hardware, leading to a more efficient 
implementation. 


3 Turtles: Design and Implementation 


The IBM Turtles nested virtualization project imple- 
ments nested virtualization for Intel’s virtualization tech- 
nology based on the KVM [29] hypervisor. It can host 
multiple guest hypervisors simultaneously, each with its 
own multiple nested guest operating systems. We have 
tested it with unmodified KVM and VMware Server as 
guest hypervisors, and unmodified Linux and Windows 
as nested guest virtual machines. Since we treat nested 
hypervisors and virtual machines as unmodified black 
boxes, the Turtles project should also run any other x86 
hypervisor and operating system. 

The Turtles project is fairly mature: it has been tested 
running multiple hypervisors simultaneously, supports 
SMP, and takes advantage of two-dimensional page table 
hardware where available in order to implement nested 
MMU virtualization via multi-dimensional paging. It 
also makes use of multi-level device assignment for effi- 
cient nested I/O virtualization. 


3.1 Theory of Operation 


There are two possible models for nested virtualization, 
which differ in the amount of support provided by the 
underlying architecture. In the first model, multi-level 
architectural support for nested virtualization, each hy- 
pervisor handles all traps caused by sensitive instructions 
of any guest hypervisor running directly on top of it. This 
model is implemented for example in the IBM System z 
architecture [35]. 

The second model, single-level architectural support 
for nested virtualization, has only a single hypervisor 
mode, and a trap at any nesting level is handled by this 
hypervisor. As illustrated in Figure 1, regardless of the 
level in which a trap occurred, execution returns to the 
level 0 trap handler. Therefore, any trap occurring at 
any level from 1... causes execution to drop to level 
0. This limited model is implemented by both Intel and 
AMD in their respective x86 virtualization extensions, 
VMX [48] and SVM [4]. 

Since the Intel x86 architecture is a single-level vir- 
tualization architecture, only a single hypervisor can 
use the processor’s VMX instructions to run its guests. 
For unmodified guest hypervisors to use VMX instruc- 
tions, this single bare-metal hypervisor, which we call 
Lo, needs to emulate VMX. This emulation of VMX can 
work recursively. Given that Lo provides a faithful em- 
ulation of the VMX hardware any time there is a trap 
on VMxX instructions, the guest running on L; will not 
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Figure 1: Nested traps with single-level architectural 
support for virtualization 


know it is not running directly on the hardware. Build- 
ing on this infrastructure, the guest at L, is itself able 
use the same techniques to emulate the VMX hardware 
to an Lz hypervisor which can then run its L3 guests. 
More generally, given that the guest at L,_1 provides a 
faithful emulation of VMX to guests at L,,, a guest at L, 
can use the exact same techniques to emulate VMX for a 
guest at L,,,1. We thus limit our discussion below to Lo, 
Li, and Lo. 

Fundamentally, our approach for nested virtualization 
works by multiplexing multiple levels of virtualization 
(multiple hypervisors) on the single level of architectural 
support for virtualization, as can be seen in Figure 2. 
Traps are forwarded by Lo between the different levels. 
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Figure 2: Multiplexing multiple levels of virtualization 
on a single hardware-provided level of support 


When L, wishes to run a virtual machine, it launches it 
via the standard architectural mechanism. This causes a 
trap, since Lj is not running in the highest privilege level 
(as is Lo). To run the virtual machine, L; supplies a spec- 
ification of the virtual machine to be launched, which 
includes properties such as its initial instruction pointer 
and its page table root. This specification must be trans- 
lated by Lo into a specification that can be used to run 
L» directly on the bare metal, e.g., by converting mem- 
ory addresses from L,’s physical address space to Lo’s 
physical address space. Thus Lo multiplexes the hard- 
ware between L, and Lg, both of which end up running 
as Lo virtual machines. 

When any hypervisor or virtual machine causes a trap, 
the Lo trap handler is called. The trap handler then in- 
spects the trapping instruction and its context, and de- 
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cides whether that trap should be handled by Lo (e.g., 
because the trapping context was L,) or whether to for- 
ward it to the responsible hypervisor (e.g., because the 
trap occurred in Lz and should be handled by L). In the 
latter case, Lo forwards the trap to L; for handling. 

When there are n levels of nesting guests, but the hard- 
ware supports less than n levels of MMU or DMA trans- 
lation tables, the n levels need to be compressed onto the 
levels available in hardware, as described in Sections 3.3 
and 3.4. 


3.2 CPU: Nested VMX Virtualization 


Virtualizing the x86 platform used to be complex and 
slow [40, 41,49]. The hypervisor was forced to re- 
sort to on-the-fly binary translation of privileged instruc- 
tions [3], slow machine emulation [8], or changes to 
guest operating systems at the source code level [6] or 
during compilation [32]. 

In due time Intel and AMD incorporated hardware 
virtualization extensions in their CPUs. These exten- 
sions introduced two new modes of operation: root mode 
and guest mode, enabling the CPU to differentiate be- 
tween running a virtual machine (guest mode) and run- 
ning the hypervisor (root mode). Both Intel and AMD 
also added special in-memory virtual machine control 
structures (VMCS and VMCB, respectively) which con- 
tain environment specifications for virtual machines and 
the hypervisor. 

The VMxX< instruction set and the VMCS layout are ex- 
plained in detail in [27]. Data stored in the VMCS can be 
divided into three groups. Guest state holds virtualized 
CPU registers (e.g., control registers or segment regis- 
ters) which are automatically loaded by the CPU when 
switching from root mode to guest mode on VMEntry. 
Host state is used by the CPU to restore register val- 
ues when switching back from guest mode to root mode 
on VMExit. Control data is used by the hypervisor to 
inject events such as exceptions or interrupts into vir- 
tual machines and to specify which events should cause 
a VMExit; it is also used by the CPU to specify the 
VMExit reason to the hypervisor. 

In nested virtualization, the hypervisor running in root 
mode (Lo) runs other hypervisors (L;) in guest mode. 
L, hypervisors have the illusion they are running in root 
mode. Their virtual machines (L2) also run in guest 
mode. 

As can be seen in Figure 3, Lg is responsible for mul- 
tiplexing the hardware between L; and Lz. The CPU 
runs L; using VMCSo-_,1 environment specification. Re- 
spectively, VMCSo_+2 is used to run Ly. Both of these 
environment specifications are maintained by Lo. In ad- 
dition, L; creates VMCS,_.2 within its own virtualized 
environment. Although VMCS,_,2 is never loaded into 
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the processor, Lo uses it to emulate a VMX enabled CPU 
for Lj. 
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Figure 3: Extending VMX for nested virtualization 








3.2.1. VMX Trap and Emulate 


VMxX instructions can only execute successfully in root 
mode. In the nested case, L; uses VMX instructions in 
guest mode to load and launch L2 guests, which causes 
VMExits. This enables Lo, running in root mode, to trap 
and emulate the VMX instructions executed by Lj. 

In general, when Lo emulates VMX instructions, it 
updates VMCS structures according to the update pro- 
cess described in the next section. Then, Lg resumes 
Lj, as though the instructions were executed directly by 
the CPU. Most of the VMX instructions executed by L; 
cause, first, a VMExit from L, to Lo, and then a VMEn- 
try from Lo to Li. 

For the instructions used to run a new VM, 
vmresume and vmlaunch, the process is different, 
since Lo needs to emulate a VMEntry from L; to Lo. 
Therefore, any execution of these instructions by L, 
cause, first, a VMExit from L; to Lo, and then, a VMEn- 
try from Lo to Lo. 


3.2.2 VMCS Shadowing 


Lo prepares a VMCS (VMCSo-,1) to run Li, exactly in 
the same way a hypervisor executes a guest with a single 
level of virtualization. From the hardware’s perspective, 
the processor is running a single hypervisor (Lo) in root 
mode and a guest (L;) in guest mode. L, is not aware 
that it is running in guest mode and uses VMX instruc- 
tions to create the specifications for its own guest, Le. 
L, defines L2’s environment by creating a VMCS 
(VMCS,_,2) which contains Ly’s environment from L,’s 
perspective. For example, the VMCSj_.2 GUEST-CR3 
field points to the page tables that L; prepared for Lo. 
Lo cannot use VMCSj-_,2 to execute Lg directly, since 
VMCSj-_,9 is not valid in Lo’s environment and Lo can- 
not use L,’s page tables to run Ly. Instead, Lo uses 
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VMCS}j_,2 to construct a new VMCS (VMCSo-_,2) that 
holds L2’s environment from Lo’s perspective. 

Lo must consider all the specifications defined 
in VMCS,_,29 and also the specifications defined in 
VMCSo_,1 to create VMCSo_,2. The host state defined in 
VMCSo-_,2 must contain the values required by the CPU 
to correctly switch back from Lz to Lo. In addition, 
VMCS}-,2 host state must be copied to VMCSo-_,1 guest 
state. Thus, when Lp emulates a switch between L» to 
L;, the processor loads the correct L, specifications. 

The guest state stored in VMCS1_,2 does not require 
any special handling in general, and most fields can be 
copied directly to the guest state of VMCSo_,2. 

The control data of VMCS;_,2 and VMCSo_,; must be 
merged to correctly emulate the processor behavior. For 
example, consider the case where L specifies to trap an 
event F’'4 in VMCSj_,2 but Lo does not trap such event 
for L; (i.e., a trap is not specified in VMCSo-_,1). To for- 
ward the event FE’, to L;, Lo needs to specify the corre- 
sponding trap in VMCSo-.2. In addition, the field used by 
L; to inject events to Lz needs to be merged, as well as 
the fields used by the processor to specify the exit cause. 

For the sake of brevity, we omit some details on how 
specific VMCS fields are merged. For the complete de- 
tails, the interested reader is encouraged to refer to the 
KVM source code [29]. 


3.2.3 VMEntry and VMExit Emulation 


In nested environments, switches from L, to Lz and back 
must be emulated. When Lg is running and a VMExit 
occurs there are two possible handling paths, depending 
on whether the VMExit must be handled only by Lo or 
must be forwarded to L. 

When the event causing the VMExit is related to Lo 
only, Lo handles the event and resumes L2. This kind of 
event can be an external interrupt, a non-maskable inter- 
rupt (NMJ) or any trappable event specified in VMCSo_,2 
that was not specified in VMCSj_,2. From L,’s perspec- 
tive this event does not exist because it was generated 
outside the scope of L,’s virtualized environment. By 
analogy to the non-nested scenario, an event occurred at 
the hardware level, the CPU transparently handled it, and 
the hypervisor continued running as before. 

The second handling path is caused by events related 
to L; (e.g., trappable events specified in VMCSj-_,2). 
In this case Lo forwards the event to L; by copying 
VMCSp-.2 fields updated by the processor to VMCSj_,2 
and resuming L;. The hypervisor running in L, believes 
there was a VMExit directly from L2 to L;. The L; hy- 
pervisor handles the event and later on resumes Lz by 
executing vmresume or vmlaunch, both of which will 
be emulated by Lo. 


3.3. MMU: Multi-dimensional Paging 


In addition to virtualizing the CPU, a hypervisor also 
needs to virtualize the MMU: A guest OS builds a guest 
page table which translates guest virtual addresses to 
guest physical addresses. These must be translated again 
into host physical addresses. With nested virtualization, 
a third layer of address translation is needed. 

These translations can be done entirely in software, 
or assisted by hardware. However, as we explain be- 
low, current hardware supports only one or two dimen- 
sions (levels) of translation, not the three needed for 
nested virtualization. In this section we present a new 
technique, multi-dimensional paging, for multiplexing 
the three needed translation tables onto the two avail- 
able in hardware. In Section 4.1.2 we demonstrate the 
importance of this technique, showing that more naive 
approaches (surveyed below) cause at least a three-fold 
slowdown of some useful workloads. 

When no hardware support for memory manage- 
ment virtualization was available, a technique known as 
shadow page tables [15] was used. A guest creates a 
guest page table, which translates guest virtual addresses 
to guest physical addresses. Based on this table, the hy- 
pervisor creates a new page table, the shadow page ta- 
ble, which translates guest virtual addresses directly to 
the corresponding host physical address [3,6]. The hy- 
pervisor then runs the guest using this shadow page table 
instead of the guest’s page table. The hypervisor has to 
trap all guest paging changes, including page fault excep- 
tions, the INVLPG instruction, context switches (which 
cause the use of a different page table) and all the guest 
updates to the page table. 

To improve virtualization performance, x86 architec- 
tures recently added two-dimensional page tables [13]— 
a second translation table in the hardware MMU. When 
translating a guest virtual address, the processor first uses 
the regular guest page table to translate it to a guest phys- 
ical address. It then uses the second table, called EPT by 
Intel (and NPT by AMD), to translate the guest physi- 
cal address to a host physical address. When an entry 
is missing in the EPT table, the processor generates an 
EPT violation exception. The hypervisor is responsible 
for maintaining the EPT table and its cache (which can 
be flushed with INVEPT), and for handling EPT viola- 
tions, while guest page faults can be handled entirely by 
the guest. 

The hypervisor, depending on the processors capabil- 
ities, decides whether to use shadow page tables or two- 
dimensional page tables to virtualize the MMU. In nested 
environments, both hypervisors, Lo and L,, determine 
independently the preferred mechanism. Thus, Lo and 
L, hypervisors can use the same or a different MMU 
virtualization mechanism. Figure 4 shows three differ- 
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ent nested MMU virtualization models. 
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Figure 4: MMU alternatives for nested virtualization 


Shadow-on-shadow is used when the processor does 
not support two-dimensional page tables, and is the least 
efficient method. Initially, Lo creates a shadow page ta- 
ble to run L; (SPTo_.1). Ly, in turn, creates a shadow 
page table to run L2 (SPT1_,2). Lo cannot use SPT;_.2 
to run Lz because this table translates Lz guest virtual 
addresses to L; host physical addresses. Therefore, Lo 
compresses SPTo_.1 and SPT,_,2 into a single shadow 
page table, SPTp_,2. This new table translates directly 
from Lg guest virtual addresses to Lo host physical ad- 
dresses. Specifically, for each guest virtual address in 
SPT1_.2, Lo creates an entry in SPTp_,2 with the corre- 
sponding Lo host physical address. 

Shadow-on-EPT is the most straightforward approach 
to use when the processor supports EPT. Lo uses the EPT 
hardware, but L; cannot use it, so it resorts to shadow 
page tables. L; uses SPT;_,2 to run Lz. Lo configures 
the MMU to use SPT _.2 as the first translation table and 
EPTg_1 as the second translation table. In this way, the 
processor first translates from L2 guest virtual address to 
L, host physical address using SPT,_,2, and then trans- 
lates from the L; host physical address to the Lo host 
physical address using the EPTo_,1. 

Though the Shadow-on-EPT approach uses the EPT 
hardware, it still has a noticeable overhead due to page 
faults and page table modifications in Lz. These must 
be handled in Lj, to maintain the shadow page ta- 
ble. Each of these faults and writes cause VMExits 
and must be forwarded from Lo to L; for handling. In 
other words, Shadow-on-EPT is slow for the exactly the 
same reasons that Shadow itself was slow for single-level 
virtualization—but it is even slower because nested exits 
are slower than non-nested exits. 

In multi-dimensional page tables, as in two- 
dimensional page tables, each level creates its own sepa- 
rate translation table. For L; to create an EPT table, Lo 
exposes EPT capabilities to L;, even though the hard- 
ware only provides a single EPT table. 

Since only one EPT table is available in hardware, the 
two EPT tables should be compressed into one: Let us 
assume that Lo runs L; using EPTo_.1, and that Lj cre- 
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ates an additional table, EPT;_,2, to run L2, because Lo 
exposed a virtualized EPT capability to L;. The Lo hy- 
pervisor could then compress EPTo_,; and EPT,_,2 into 
a single EPTg_,2 table as shown in Figure 4. Then Lo 
could run Lz using EPTo_,2, which translates directly 
from the Lz guest physical address to the Lo host physi- 
cal address, reducing the number of page fault exits and 
improving nested virtualization performance. In Sec- 
tion 4.1.2 we demonstrate more than a three-fold speedup 
of some useful workloads with multi-dimensional page 
tables, compared to shadow-on-EPT. 

The Lo hypervisor launches Lz with an empty EPTo_.2 
table, building the table on-the-fly, on Lz EPT-violation 
exits. These happen when a translation for a guest phys- 
ical address is missing in the EPT table. If there is no 
translation in EPT;_,2 for the faulting address, Lo first 
lets L; handle the exit and update EPT;_,2. Lo can now 
create an entry in EPTp_,2 that translates the Lz guest 
physical address directly to the Lo host physical address: 
EPT1_,2 is used to translate the L2 physical address to a 
L, physical address, and EPTp_,; translates that into the 
desired Lo physical address. 

To maintain correctness of EPTp_,2, the Lo hypervisor 
needs to know of any changes that L; makes to EPT,_,2. 
Lg sets the memory area of EPT;_,2 as read-only, thereby 
causing a trap when L, tries to update it. Lo will then up- 
date EPTg_,2 according to the changed entries in EPT;_,2. 
Lo also needs to trap all L; INVEPT instructions, and in- 
validate the EPT cache accordingly. 

By using huge pages [34] to back guest memory, Lo 
can create smaller and faster EPT tables. Finally, to 
further improve performance, Lo also allows L; to use 
VPIDs. With this feature, the CPU tags each transla- 
tion in the TLB with a numeric virtual-processor id, 
eliminating the need for TLB flushes on every VMEn- 
try and VMExit. Since each hypervisor is free to choose 
these VPIDs arbitrarily, they might collide and therefore 
Lo needs to map the VPIDs that L; uses into valid Lo 
VPIDs. 


3.4 I/O: Multi-level Device Assignment 


I/O is the third major challenge in server virtualization. 
There are three approaches commonly used to provide 
I/O services to a guest virtual machine. Either the hyper- 
visor emulates a known device and the guest uses an un- 
modified driver to interact with it [47], or a para-virtual 
driver is installed in the guest [6,42], or the host assigns 
a real device to the guest which then controls the device 
directly [11,31,37,52,53]. Device assignment generally 
provides the best performance [33, 38,53], since it mini- 
mizes the number of I/O-related world switches between 
the virtual machine and its hypervisor, and although it 
complicates live migration, device assignment and live 
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migration can peacefully coexist [26, 28,54]. 

These three basic I/O approaches for a single-level 
guest imply nine possible combinations in the two-level 
nested guest case. Of the nine potential combinations 
we evaluated the more interesting cases, presented in Ta- 
ble 1. Implementing the first four alternatives is straight- 
forward. We describe the last option, which we call 
multi-level device assignment, below. Multi-level de- 
vice assignment lets the Lz guest access a device di- 
rectly, bypassing both hypervisors. This direct device 
access requires dealing with DMA, interrupts, MMIO, 
and PIOs [53]. 
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Table 1: I/O combinations for a nested guest 


Device DMA in virtualized environments is compli- 
cated, because guest drivers use guest physical addresses, 
while memory access in the device is done with host 
physical addresses. The common solution to the DMA 
problem is an IOMMU [2, 11], a hardware component 
which resides between the device and main memory. It 
uses a translation table prepared by the hypervisor to 
translate the guest physical addresses to host physical 
addresses. IOMMUs currently available, however, only 
support a single level of address translation. Again, we 
need to compress two levels of translation tables onto the 
one level available in hardware. 

For modified guests this can be done using a paravir- 
tual IOMMU: the code in L; which sets a mapping on 
the IOMMU from Lz to L; addresses is replaced by a 
hypercall to Lo. Lo changes the L; address in that map- 
ping to the respective Lo address, and puts the resulting 
mapping (from L2 to Lo addresses) in the IOMMU. 

A better approach, one which can run unmodified 
guests, is for Lp to emulate an IOMMU for L, [5]. Ly 
believes that it is running on a machine with an IOMMU, 
and sets up mappings from Lz to L, addresses on it. Lo 
intercepts these mappings, remaps the L, addresses to 
Lo addresses, and builds the L2-to-Lo map on the real 
IOMMU. 

In current x86 architecture, interrupts always cause a 
guest exit to Lo, which proceeds to forward the interrupt 
to L;. L; will then inject it into Lg. The EOI (end of 
interrupt) will also cause a guest exit. In Section 4.1.1 we 
discuss the slowdown caused by these interrupt-related 
exits, and propose ways to avoid it. 


Memory-mapped I/O (MMIO) and Port I/O (PIO) for 
a nested guest work the same way they work for a single- 
level guest, without incurring exits on the critical I/O 
path [53]. 


3.5 Micro Optimizations 


There are two main places where a guest of a nested hy- 
pervisor is slower than the same guest running on a bare- 
metal hypervisor. First, the transitions between L; and 
Ly are slower than the transitions between Lo and Lj. 
Second, the exit handling code running in the L, hyper- 
visor is slower than the same code running in Lo. In this 
section we discuss these two issues, and propose opti- 
mizations that improve performance. Since we assume 
that both L, and Lz are unmodified, these optimizations 
require modifying Lo only. We evaluate these optimiza- 
tions in the evaluation section. 


3.5.1 Optimizing transitions between L; and L2 


As explained in Section 3.2.3, transitions between L; 
and Lz involve an exit to Lo and then an entry. In 
Lo, most of the time is spent merging the VMCS’s. We 
optimize this merging code to only copy data between 
VMCS’s if the relevant values were modified. Keeping 
track of which values were modified has an intrinsic cost, 
so one must carefully balance full copying versus partial 
copying and tracking. We observed empirically that for 
common workloads and hypervisors, partial copying has 
a lower overhead. 

VMCS merging could be further optimized by copy- 
ing multiple VMCs fields at once. However, according to 
Intel’s specifications, reads or writes to the VMCS area 
must be performed using vmread and vmwrite in- 
structions, which operate on a single field. We empiri- 
cally noted that under certain conditions one could ac- 
cess VMCS data directly without ill side-effects, bypass- 
ing vmread and vmwrite and copying multiple fields 
at once with large memory copies. However, this opti- 
mization does not strictly adhere to the VMX specifica- 
tions, and thus might not work on processors other than 
the ones we have tested. 

In the evaluation section, we show that this opti- 
mization gives a significant performance boost in micro- 
benchmarks. However, it did not noticeably improve the 
other, more typical, workloads that we have evaluated. 


3.5.2 Optimizing exit handling in L, 


The exit-handling code in the hypervisor is slower when 
run in L; than the same code running in Lo. The main 
cause of this slowdown are additional exits caused by 
privileged instructions in the exit-handling code. 
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In Intel VMX, the privileged instructions vmread and 
vmwrite are used by the hypervisor to read and mod- 
ify the guest and host specification. As can be seen in 
Section 4.3, these cause L; to exit multiple times while 
it handles a single L2 exit. 


In contrast, in AMD SVM, guest and host specifica- 
tions can be read or written to directly using ordinary 
memory loads and stores. The clear advantage of that 
model is that Lo does not intervene while L; modifies 
Lo specifications. Removing the need to trap and emu- 
late special instructions reduces the number of exits and 
improves nested virtualization performance. 


One thing Lo can do to avoid trapping on every 
vmread and vmwrite is binary translation [3] of prob- 
lematic vmread and vmwrite instructions in the L; 
instruction stream, by trapping the first time such an in- 
struction is called and then rewriting it to branch to a 
non-trapping memory load or store. To evaluate the po- 
tential performance benefit of this approach, we tested a 
modified L, that directly reads and writes VMCS,_,2 in 
memory, instead of using vmread and vmwrite. The 
performance of this setup, which we call DRW (direct 
read and write) is described in the evaluation section. 


4 Evaluation 


We start the evaluation and analysis of nested virtual- 
ization with macro benchmarks that represent real-life 
workloads. Next, we evaluate the contribution of multi- 
level device assignment and multi-dimensional paging to 
nested virtualization performance. Most of our experi- 
ments are executed with KVM as the L, guest hyper- 
visor. In Section 4.2 we present results with VMware 
Server as the L; guest hypervisor. 


We then continue the evaluation with a synthetic, 
worst-case micro benchmark running on Lz which 
causes guest exits in a loop. We use this synthetic, worst- 
case benchmark to understand and analyze the overhead 
and the handling flow of a single Lz exit. 


Our setup consisted of an IBM x3650 machine booted 
with a single Intel Xeon 2.9GHz core and with 3GB of 
memory. The host OS was Ubuntu 9.04 with a kernel 
that is based on the KVM git tree version kvm-87, with 
our nested virtualization support added. For both L, and 
Lz guests we used an Ubuntu Jaunty guest with a kernel 
that is based on the KVM git tree, version kvm-87. Lj 
was configured with 2GB of memory and L2 was config- 
ured with 1GB of memory. For the I/O experiments we 
used a Broadcom NetXtreme 1Gb/s NIC connected via 
crossover-cable to an e1000e NIC on another machine. 
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4.1 Macro Workloads 


kernbench is a general purpose compilation-type 
benchmark that compiles the Linux kernel multiple 
times. The compilation process is, by nature, CPU- and 
memory-intensive, and it also generates disk I/O to load 
the compiled files into the guest’s page cache. 

SPECJbb is an industry-standard benchmark de- 
signed to measure the server-side performance of Java 
run-time environments. It emulates a three-tier system 
and is primarily CPU-intensive. 

We executed kernbench and SPEC jbb in four se- 
tups: host, single-level guest, nested guest, and nested 
guest optimized with direct read and write (DRW) as de- 
scribed in Section 3.5.2. The optimizations described 
in Section 3.5.1 did not make a significant difference to 
these benchmarks, and are thus omitted from the results. 
We used KVM as both Lo and L; hypervisor with multi- 
dimensional paging. The results are depicted in Table 2. 



























































Kernbench 

Host Guest | Nested | Nestedprw 
Run time 324.3 | 355 406.3 391.5 
STD dev. 1.5 10 6.7 3.1 
% overhead 
vs. host - 9.5 293 20.7 
% overhead 
vs. guest - - 14.5 10.3 
%CPU 93 97 99 99 

SPECjbb 

Host Guest | Nested | Nestedprw 
Score 90493 | 83599 | 77065 | 78347 
STD dev. 1104 1230 1716 566 
% degradati- 
on vs. host - 7.6 14.8 13.4 
% degradati- 
on vs. guest | - - 7.8 6.3 
%CPU 100 100 100 100 











Table 2: kernbench and SPEC jbb results 





We compared the impact of running the workloads in a 
nested guest with running the same workload in a single- 
level guest, i.e., the overhead added by the additional 
level of virtualization. For kernbench, the overhead 
of nested virtualization is 14.5%, while for SPECjbb the 
score is degraded by 7.82%. When we discount the 
Intel-specific vmread and vmwrite overhead in L,, 
the overhead is 10.3% and 6.3% respectively. 

To analyze the sources of overhead, we examine the 
time distribution between the different levels. Figure 5 
shows the time spent in each level. It is interesting to 
compare the time spent in the hypervisor in the single- 
level case with the time spent in L, in the nested guest 
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case, since both hypervisors are expected to do the same 
work. The times are indeed similar, although the L, hy- 
pervisor takes more cycles due to cache pollution and 
TLB flushes, as we show in Section 4.3. The signifi- 
cant part of the virtualization overhead in the nested case 
comes from the time spent in Lg and the increased num- 
ber of exits. 


For SPECjbb, the total number of cycles across 
all levels is the same for all setups. This is because 
SPEC jbb executed for the same pre-set amount of time 
in both cases and the difference was in the benchmark 
score. 





Efficiently virtualizing a hypervisor is hard. Nested 
virtualization creates a new kind of workload for the Lo 
hypervisor which did not exist before: running another 
hypervisor (Lj) as a guest. As can be seen in Figure 5, 
for kernbench Lg takes only 2.28% of the overall cy- 
cles in the single-level guest case, but takes 5.17% of the 
overall cycles for the nested-guest case. In other words, 
Lo has to work more than twice as hard when running a 
nested guest. 


Not all exits of Lz incur the same overhead, as each 
type of exit requires different handling in Lo and Lj. In 
Figure 6, we show the total number of cycles required 
to handle each exit type. For the single level guest we 
measured the number of cycles between VMExit and the 
consequent VMEntry. For the nested guest we measured 
the number of cycles spent between Lz VMExit and the 
consequent Lz VMEntry. 


There is a large variance between the handling times 
of different types of exits. The cost of each exit comes 
primarily from the number of privileged instructions per- 
formed by Lj, each of which causes an exit to Lo. For ex- 
ample, when L, handles a PIO exit of Lg, it generates on 
average 31 additional exits, whereas in the cpuid case 
discussed later in Section 4.3 only 13 exits are required. 
Discounting traps due to vmread and vmwrite, the 
average number of exits was reduced to 14 for PIO and 
to 2 for cpuid. 


Another source of overhead is heavy-weight exits. The 
external interrupt exit handler takes approximately 64K 
cycles when executed by Lo. The PIO exit handler takes 
approximately 12K cycles when executed by Lo. How- 
ever, when those handlers are executed by Lj, they take 
much longer: approximately 192K cycles and 183K cy- 
cles, respectively. Discounting traps due to vmread 
and vmwrite, they take approximately 148K cycles and 
130K cycles, respectively. This difference in execution 
times between Lo and L; is due to two reasons: first, the 
handlers execute privileged instructions causing exits to 
Lo. Second, the handlers run for a long time compared 
with other handlers and therefore more external events 
such as external interrupts occur during their run-time. 
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Figure 6: Cycle costs of handling different types of exits 


4.1.1 I/O Intensive Workloads 


To examine the performance of a nested guest in the 
case of I/O intensive workloads we used netperf, a 
TCP streaming application that attempts to maximize the 
amount of data sent over a single TCP connection. We 
measured the performance on the sender side, with the 
default settings of netperf (16,384 byte messages). 

Figure 7 shows the results for running the netperf 
TCP stream test on the host, in a single-level guest, and in 
a nested guest, using the five I/O virtualization combina- 
tions described in Section 3.4. We used KVM’s default 
emulated NIC (RTL-8139), virtio [42] for a paravirtual 
NIC, and a 1 Gb/s Broadcom NetXtreme IT with device 
assignment. All tests used a single CPU core. 

On bare-metal, netperf easily achieved line rate 
(940 Mb/s) with 20% CPU utilization. 

Emulation gives a much lower throughput, with full 
CPU utilization: On a single-level guest we get 25% 
of the line rate. On the nested guest the throughput is 
even lower and the overhead is dominated by the cost of 
device emulation between L; and Ly. Each Lg exit is 
trapped by Lo and forwarded to L;. For each Lz exit, Ly 
then executes multiple privileged instructions, incurring 
multiple exits back to Lo. In this way the overhead for 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10) 431 


432 


IB throughput (Mbps) 
Jocpu 
1,000 T T 100 























“--) 80 





+) 60 


L 
% cpu 


“4 40 





throughput (Mbps) 
wr 
S 
S 
T 















































Q,. St Sy, Sy, % X 
ty, lay “, Y 5 Cy Sy fs 
le Offa, ke, Pela, & E& Ge Le, 
e ie io Ss 
Meh Tle, WE fog bg US US ESI PSN 
Oey Le Cea last Cie bee ee ise 
ae 
Ye, Veg 06,8“ Way, Co CG & 
Uy, "On 


Figure 7: Performance of net perf in various setups 


each Lg exit is multiplied. 

The para-virtual virtio NIC performs better than emu- 
lation since it reduces the number of exits. Using virtio 
all the way up to Lz gives 75% of line rate with a satu- 
rated CPU, better but still considerably below bare-metal 
performance. 

Multi-level device assignment achieved the best per- 
formance, with line rate at 60% CPU utilization (Fig- 
ure 7, direct/direct). Using device assignment between 
Lo and Lj and virtio between L; and L2 enables the L2 
guest to saturate the 1Gb link with 92% CPU utilization 
(Figure 7, direct/virtio). 

While multi-level device assignment outperformed the 
other methods, its measured performance is still subop- 
timal because 60% of the CPU is used for running a 
workload that only takes 20% on bare-metal. Unfortu- 
nately on current x86 architecture, interrupts cannot be 
assigned to guests, so both the interrupt itself and its EOI 
cause exits. The more interrupts the device generates, 
the more exits, and therefore the higher the virtualiza- 
tion overhead—which is more pronounced in the nested 
case. We hypothesize that these interrupt-related exits 
are the biggest source of the remaining overhead, so had 
the architecture given us a way to avoid these exits—by 
assigning interrupts directly to guests rather than having 
each interrupt go through both hypervisors—net perf 
performance on Lz would be close to that of bare-metal. 

To test this hypothesis we reduced the number of in- 
terrupts, by modifying standard bnx2 network driver to 
work without any interrupts, i.e., continuously poll the 
device for pending events 

Figure 8 compares some of the I/O virtualization com- 
binations with this polling driver. Again, multi-level de- 
vice assignment is the best option and, as we hypothe- 
sized, this time Lz performance is close to bare-metal. 
With netperf’s default 16,384 byte messages, the 
throughput is often capped by the 1 Gb/s line rate, so we 
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Figure 8: Performance of netperf with interrupt-less 
network driver 


ran netperf with smaller messages. As we can see in the 
figure, for 64-byte messages, for example, on Lo (bare 
metal) a throughput of 900 Mb/s is achieved, while on 
Lz with multi-level device assignment, we get 837 Mb/s, 
a mere 7% slowdown. The runner-up method, virtio on 
direct, was not nearly as successful, and achieved just 
469 Mb/s, 50% below bare-metal performance. CPU 
utilization was 100% in all cases since a polling driver 
consumes all available CPU cycles. 


4.1.2 Impact of Multi-dimensional Paging 


To evaluate multi-dimensional paging, we compared 
each of the macro benchmarks described in the previ- 
ous sections with and without multi-dimensional paging. 
For each benchmark we configured Lo to run L; with 
EPT support. We then compared the case where L, uses 
shadow page tables to run Lz (“Shadow-on-EPT”’) with 
the case of L; using EPT to run Lo (“multi-dimensional 


paging”). 
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Figure 9: Impact of multi-dimensional paging 


Figure 9 shows the results. The overhead between the 
two cases is mostly due to the number of page-fault exits. 
When shadow paging is used, each page fault of the L2 
guest results ina VMExit. When multi-dimensional pag- 
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ing is used, only an access to a guest physical page that is 
not mapped in the EPT table will cause an EPT violation 
exit. Therefore the impact of multi-dimensional paging 
depends on the number of guest page faults, which is a 
property of the workload. The improvement is startling 
in benchmarks such as kernbench with a high number 
of page faults, and is less pronounced in workloads that 
do not incur many page faults. 


4.2 VMware Server as a Guest Hypervisor 


We also evaluated VMware as the L; hypervisor to ana- 
lyze how a different guest hypervisor affects nested vir- 
tualization performance. We used the hosted version, 
VMWare Server v2.0.1, build 156745 x86-64, on top of 
Ubuntu based on kernel 2.6.28-11. We intentionally did 
not install VMware tools for the Lz guest, thereby in- 
creasing nested virtualization overhead. Due to similar 
results obtained for VMware and KVM as the nested hy- 
pervisor, we show only kernbench and SPEC jbb re- 
sults below. 








overhead of handling guest exits in Lo and L;. Based on 
this definition, this cpuid micro benchmark is a worst 
case workload, since Ly does virtually nothing except 
generate exits. We note that cpuid cannot in the gen- 
eral case be handled by Lo directly, as L; may wish to 
modify the values returned to Lo. 

Figure 10 shows the number of CPU cycles required to 
execute a single cpuid instruction. We ran the cpuid 
instruction 4* 10° times and calculated the average num- 
ber of cycles per iteration. We repeated the test for the 
following setups: 1. native, 2. running cpuidina single 
level guest, and 3. running cpuid ina nested guest with 
and without the optimizations described in Section 3.5. 
For each execution, we present the distribution of the cy- 
cles between the levels: Lo, Li, Lz. CPU mode switch 
stands for the number of cycles spent by the CPU when 
performing a VMEntry or a VMExit. On bare metal 
cpuid takes about 100 cycles, while in a virtual ma- 
chine it takes about 2,600 cycles (Figure 10, column 1), 
about 1,000 of which is due to the CPU mode switch- 
ing. When run in a nested virtual machine it takes about 
58,000 cycles (Figure 10, column 2). 





Benchmark | % overhead vs. single-level guest 
kernbench 14.98 
SPEC jbb 8.85 
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Table 3: VMware Server as a guest hypervisor 


Examining L, exits, we noticed VMware Server 
uses VMX initialization instructions (vmon, vmoff, 
vmptrid, vmclear) several times during L2 execu- 
tion. Conversely, KVM uses them only once. This 
dissimilitude derives mainly from the approach used by 
VMware to interact with the host Linux kernel. Each 
time the monitor module takes control of the CPU, it en- 
ables VMX. Then, before it releases control to the Linux 
kernel, VMX is disabled. Furthermore, during this tran- 
sition many non-VMxX privileged instructions are exe- 
cuted by Lj, increasing Lo intervention. 

Although all these initialization instructions are emu- 
lated by Lo, transitions from the VMware monitor mod- 
ule to the Linux kernel are less frequent for Kernbench 
and SPECjbb. The VMware monitor module typically 
handles multiple Lz exits before switching to the Linux 
kernel. As a result, this behavior only slightly affected 
the nested virtualization performance. 


4.3 Micro Benchmark Analysis 


To analyze the cycle-costs of handling a single L2 exit, 
we ran a micro benchmark in L2 that does nothing ex- 
cept generate exits by calling cpuid ina loop. The vir- 
tualization overhead for running an Lg guest is the ratio 
between the effective work done by the Lz guest and the 
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Figure 10: CPU cycle distribution for cpuid 


To understand the cost of handling a nested guest 
exit compared to the cost of handling the same exit for 
a single-level guest, we analyzed the flow of handling 
cpuid: 


L» executes a cpuid instruction 

CPU traps and switches to root mode Lo 

Lo switches state from running L2 to running L, 
. CPU switches to guest mode Lj 

L, modifies VMCS1_.2 

repeat n times: 


ou a et 


(a) Ly accesses VMCS1_,2 

(b) CPU traps and switches to root mode Lo 

(c) Lo emulates VMCSj_,2 access and resumes L, 
(d) CPU switches to guest mode Lj 


6. L, emulates cpuid for L2 
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7. Ly executes a resume of Lo 

8. CPU traps and switches to root mode Lo 

9. Lo switches state from running L; to running L2 
10. CPU switches to guest mode L2 


In general, step 5 can be repeated multiple times. Each 
iteration consists of a single VMExit from L; to Lo. 
The total number of exits depends on the specific im- 
plementation of the L; hypervisor. A nesting-friendly 
hypervisor will keep privileged instructions to a mini- 
mum. In any case, the L; hypervisor must interact with 
VMCSj-_.2, as described in Section 3.2.2. In the case of 
cpuid, in step 5, Li reads 7 fields of VMCS;_,2, and 
writes 4 fields to VMCS1_,2, which ends up as 11 VMEx- 
its from L; to Lo. Overall, for a single Lz cpuid exit 
there are 13 CPU mode switches from guest mode to 
root mode and 13 CPU mode switches from root mode 
to guest mode, specifically in steps: 2, 4, Sb, 5d, 8, 10. 

The number of cycles the CPU spends in a single 
switch to guest mode plus the number of cycles to switch 
back to root mode, is approximately 1,000. The total 
CPU switching cost is therefore around 13,000 cycles. 


The other two expensive steps are 3 and 9. As de- 
scribed in Section 3.5, these switches can be optimized. 
Indeed as we show in Figure 10, column 3, using various 
optimizations we can reduce the virtualization overhead 
by 25%, and by 80% when using non-trapping vmread 
and vmwrite instructions. 


By avoiding traps on vmread and vmwrite (Fig- 
ure 10, columns 4 and 5), we removed the exits caused 
by VMCSj-,2 accesses and the corresponding VMCS ac- 
cess emulation, step 5. This optimization reduced the 
switching cost by 84.6%, from 13,000 to 2,000. 


While it might still be possible to optimize steps 3 
and 9 further, it is clear that the exits of L; while han- 
dling a single exit of La, and specifically VMCS accesses, 
are a major source of overhead. Architectural support for 
both faster world switches and VMCS updates without ex- 
its will reduce the overhead. 


Examining Figure 10, it seems that handling cpuid 
in Ly is more expensive than handling cpuid in Lo. 
Specifically, in column 3, the nested hypervisor Ly 
spends around 5,000 cycles to handle cpuid, while in 
column | the same hypervisor running on bare metal 
only spends 1500 cycles to handle the same exit (note 
that these numbers do not include the mode switches). 
The code running in L, and in Lg is identical; the differ- 
ence in cycle count is due to cache pollution. Running 
the cpuid handling code incurs on average 5 L2 cache 
misses and 2 TLB misses when run in Lo, whereas run- 
ning the exact same code in Lj incurs on average 400 L2 
cache misses and 19 TLB misses. 
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5 Discussion 


In nested environments we introduce a new type of work- 
load not found in single-level virtualization: the hypervi- 
sor as a guest. Traditionally, x86 hypervisors were de- 
signed and implemented assuming they will be running 
directly on bare metal. When they are executed on top of 
another hypervisor this assumption no longer holds and 
the guest hypervisor behavior becomes a key factor. 

With a nested L; hypervisor, the cost of a single L2 
exit depends on the number of exits caused by L; dur- 
ing the Lz exit handling. A nesting-friendly L; hyper- 
visor should minimize this critical chain to achieve bet- 
ter performance, for example by limiting the use of trap- 
causing instructions in the critical path. 

Another alternative for reducing this critical chain is to 
para-virtualize the guest hypervisor, similar to OS para- 
virtualization [6, 50,51]. While this approach could re- 
duce Lg intervention when L, virtualizes the L2 envi- 
ronment, the work being done by Lo to virtualize the 
L,; environment will still persist. How much this tech- 
nique can help depends on the workload and on the spe- 
cific approach used. Taking as a concrete example the 
conversion of vmreads and vmwrites to non-trapping 
load/stores, para-virtualization could reduce the over- 
head for kernbench from 14.5% to 10.3%. 


5.1 Architectural Overhead 


Part of the overhead introduced with nested virtualization 
is due to the architectural design choices of x86 hardware 
virtualization extensions. 

Virtualization API: Two performance sensitive areas 
in x86 virtualization are memory management and I/O 
virtualization. With multi-dimensional paging we com- 
pressed three MMU translation tables onto the two avail- 
able in hardware; multi-level device assignment does 
the same for IOMMU translation tables. Architectural 
support for multiple levels of MMU and DMA transla- 
tion tables—as many tables as there are levels of nested 
hypervisors—will immediately improve MMU and I/O 
virtualization. 

Architectural support for delivering interrupts directly 
from the hardware to the Lz guest will remove Lo inter- 
vention on interrupt delivery and completion, interven- 
tion which, as we explained in Section 4.1.1, hurts nested 
performance. Such architectural support will also help 
single-level I/O virtualization performance [33]. 

VMxX features such as MSR bitmaps, I/O bitmaps, and 
CR masks/shadows [48] proved to be effective in reduc- 
ing exit overhead. Any architectural feature that reduces 
single-level exit overhead also shortens the nested critical 
path. Such features, however, also add implementation 
complexity, since to exploit them in nested environments 
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they must be properly emulated by Lo hypervisors. 

Removing the (Intel-specific) need to trap on every 
vmread and vmwrite instruction will give an imme- 
diate performance boost, as we showed in Section 3.5.2. 

Same Core Constraint: The x86 trap-and-emulate 
implementation dictates that the guest and hypervisor 
share each core, since traps are always handled on the 
core where they occurred. Due to this constraint, when 
the hypervisor handles an exit the guest is temporarily 
stopped on that core. In a nested environment, the L; 
guest hypervisor will also be interrupted, increasing the 
total interruption time of the Lz guest. Gavrilovska, et 
al., presented techniques for exploiting additional cores 
to handle guest exits [19]. According to the authors, for 
a single level of virtualization, they measured 41% aver- 
age improvements in call latency for null calls, cpuid and 
page table updates. These techniques could be adapted 
for nested environments in order to remove Lo interven- 
tions and also reduce privileged instructions call laten- 
cies, decreasing the total interruption time of a nested 
guest. 

Cache Pollution: Each time the processor switches 
between the guest and the host context on a single core, 
the effectiveness of its caches is reduced. This phe- 
nomenon is magnified in nested environments, due to 
the increased number of switches. As was seen in Sec- 
tion 4.3, even after discounting Lo intervention, the L; 
hypervisor still took more cycles to handle an Lg exit 
than it took to handle the same exit for the single-level 
scenario, due to cache pollution. Dedicating cores to 
guests could reduce cache pollution [7, 45, 46] and in- 
crease performance. 


6 Conclusions and Future Work 


Efficient nested x86 virtualization is feasible, despite 
the challenges stemming from the lack of architectural 
support for nested virtualization. Enabling efficient 
nested virtualization on the x86 platform through multi- 
dimensional paging and multi-level device assignment 
opens exciting avenues for exploration in such diverse 
areas as security, clouds, and architectural research. 

We are continuing to investigate architectural and 
software-based methods to improve the performance 
of nested virtualization, while simultaneously exploring 
ways of building computer systems that have nested vir- 
tualization built-in. 

Last, but not least, while the Turtles project is fairly 
mature, we expect that the additional public exposure 
stemming from its open source release will help enhance 
its stability and functionality. We look forward to see- 
ing in what interesting directions the research and open 
source communities will take it. 
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Abstract 


Virtualized servers run a diverse set of virtual machines 
(VMs), ranging from interactive desktops to test and de- 
velopment environments and even batch workloads. Hy- 
pervisors are responsible for multiplexing the underlying 
hardware resources among VMs while providing them 
the desired degree of isolation using resource manage- 
ment controls. Existing methods provide many knobs 
for allocating CPU and memory to VMs, but support for 
control of IO resource allocation has been quite limited. 
IO resource management in a hypervisor introduces sig- 
nificant new challenges and needs more extensive con- 
trols than in commodity operating systems. 

This paper introduces a novel algorithm for IO re- 
source allocation in a hypervisor. Our algorithm, 
mClock, supports proportional-share fairness subject to 
minimum reservations and maximum limits on the IO 
allocations for VMs. We present the design of mClock 
and a prototype implementation inside the VMware ESX 
server hypervisor. Our results indicate that these rich 
QoS controls are quite effective in isolating VM perfor- 
mance and providing better application latency. We also 
show an adaptation of mClock (called dmClock) for a 
distributed storage environment, where storage is jointly 
provided by multiple nodes. 


1 Introduction 


The increasing trend towards server virtualization has el- 
evated hypervisors to first class entities in today’s data- 
centers. Virtualized hosts run tens to hundreds of virtual 
machines (VMs), and the hypervisor needs to provide 
each virtual machine with the illusion of owning ded- 
icated physical resources: CPU, memory, network and 
storage IO. Strong isolation is needed for successful con- 
solidation of VMs with diverse requirements on a shared 
infrastructure. Existing products such as VMware ESX 
server hypervisor provide guarantees for CPU and mem- 
ory allocation using sophisticated controls such as reser- 
vations, limits and shares [3, 44]. However, the cur- 
rent state of the art in storage IO resource allocation 
is much more rudimentary, limited to providing propor- 
tional shares [20] to different VMs. 

IO scheduling in a hypervisor introduces many new 
challenges compared to managing other shared re- 
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Figure 1: Orders/sec for VM5 decreases as the load on 
the shared storage device increases from VMs running 
on other hosts. 


sources. First, virtualized servers typically access a 
shared storage device using either a clustered file system 
such as VMFS [11] or NFS volumes. A storage device 
in the guest OS or a VM is just a large file on the shared 
storage device. Second, the IO scheduler in the hypervi- 
sor runs one layer below the elevator-based scheduling 
in the guest OS. Hence, it needs to handle issues such as 
locality of accesses across VMs, high variability in IO 
sizes, different request priorities based on the applica- 
tions running in the VMs, and bursty workloads. 

In addition, the amount of IO throughput available to 
any particular host can fluctuate widely based on the be- 
havior of other hosts accessing the shared device. Unlike 
CPU and memory resources, the IO throughput avail- 
able to a host is not under its own control. As shown 
in the example below, this can cause large variations in 
the IOPS available to a VM and impact application-level 
performance. 

Consider the simple scenario shown in Figure 1, with 
three hosts and five VMs. Each VM is running a DVD- 
Store [2] benchmark, which is an IO-intensive OLTP 
workload. The system administrator has carefully pro- 
visioned the resources (CPU and memory) needed by 
VM 5, so that it can serve at least 400 orders per second. 
Initially, VM 5 is running on host 3, and it achieves a 
transaction rate of roughly 500 orders/second. Later, as 
four other VMs (1 — 4), running on two separate hosts 
sharing the same storage device, start to consume IO 
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bandwidth, the transaction rate of VM 5 drops to 275 
orders per second, which is significantly lower than ex- 
pected. Other events that can cause this sort of fluctua- 
tion are: (1) changes in workloads (2) background tasks 
scheduled at the storage array, and (3) changes in SAN 
paths between the hosts and storage device. 

PARDA [20] provided a distributed control algorithm 
to allocate queue slots at the storage device to hosts in 
proportion to the aggregate IO shares of the VMs run- 
ning on them. The local IO scheduling at each host 
was done using SFQ(D) [24] a traditional fair-scheduler, 
which divides the aggregate host throughput among the 
VMs in proportion to their shares. Unfortunately, as ag- 
gregate throughput fluctuates downwards, or as the value 
of a VM’s shares is diluted by the addition of other VMs 
to the system, the absolute throughput for a VM falls. 
This open-ended dilution is unacceptable in many appli- 
cations that require minimum resource requirements to 
function. Lack of QoS support for IO resources can have 
widespread effects, rendering existing CPU and mem- 
ory controls ineffective when applications block on IO 
requests. Arguably, this limitation is one of the reasons 
for the slow adoption of IO-intensive applications in vir- 
tualized environments. 

Resource controls such as shares (a.k.a. weights), 
reservations, and limits are used for predictable service 
allocation with strong isolation [8, 34, 43, 44]. Shares 
are a relative allocation measure that specify the ratio in 
which the different VMs receive service. Reservations 
and limits are expressed in absolute units, e.g. CPU cy- 
cles/sec or megabytes of memory. The general idea is to 
allocate the resource to the VMs in proportion to their 
shares, subject to the constraints that each VM receives 
at least its reservation and no more than its limit. These 
controls have primarily been employed for allocating re- 
sources like CPU time and memory pages where the re- 
source capacity is known and fixed. 

For fixed-capacity resources, one can combine shares 
and reservations into one single allocation for a VM. 
This allocation can be calculated whenever a new VM 
enters or leaves the system, since these are the only 
events at which the allocation is affected. However, en- 
forcing these controls is much more difficult when the 
capacity fluctuates dynamically, as is the case for the IO 
bandwidth of shared storage. In this case the allocations 
need to be continuously monitored (rather than only at 
VM entry and exit) to ensure that no VM falls below 
its minimum. A brute-force solution is to emulate the 
method used for fixed-capacity resources by recomput- 
ing the allocations periodically. However this method 
relies on accurately being able to predict future capacity 
based on the current state. 

Finally, /imits provide an upper bound on the absolute 
resource allocations. Such a limit on IO performance 
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is desirable to prevent competing IO-intensive applica- 
tions, such as virus scanners, virtual-disk migrations, or 
backup operations, from consuming all the spare band- 
width in the system, which can result in high latencies 
for bursty and ON-OFF workloads. There are yet other 
reasons cited by service providers for wanting to explic- 
itly limit IO throughput; for example, to avoid giving 
VMs more throughput than has been paid for, or to avoid 
raising expectations on performance that cannot gener- 
ally be sustained [1, 8]. 

In this paper, we present mClock, an IO scheduler that 
provides all three controls mentioned above at a per- VM 
level (Figure 2). We believe that mClock is the first 
scheduler to provide such controls in the presence of 
capacity fluctuations at short time scales. We have im- 
plemented mClock, along with certain storage-specific 
optimizations, as a prototype scheduler in the VMware 
ESX server hypervisor and showed its effectiveness for 
various use cases. 

We also demonstrate dmClock, a distributed version 
of the algorithm that can be used in clustered storage 
systems, where the storage is distributed across multiple 
nodes (e.g., LeftHand [4], Seanodes [6], IceCube [46], 
FAB [30]). dmClock ensures that the overall alloca- 
tion to each VM is based on the specified shares, reser- 
vations, and limits even when the VM load is non- 
uniformly distributed across the storage nodes. 

The remainder of the paper is organized as follows. In 
Section 2 we discuss mClock’s scheduling goal and its 
comparison with existing approaches. Section 3 presents 
the mClock algorithm in detail, along with storage- 
specific optimizations. Distributed implementation for 
a clustered storage system is discussed in Section 3.2. 
Detailed performance evaluation using a diverse set of 
workloads is presented in Section 4. Finally we con- 
clude with some directions for future work in Section 5. 


2 Overview and Related Work 


The work related to QoS-based IO resource allocation 
can be divided into three broad areas. First is the class 
of algorithms that provide proportional allocation of IO 
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Algorithm class Proportional | Latency Reservation | Limit Handle Capacity 
allocation support i — —— 








| Proportional Sharing (PS) Algorithms | Sharing | Proportional Sharing (PS) Algorithms | Algorithms | Yes }No | |}No | No |No 
PS + Reservations Yes Yes Yes 
mClock Yes Yes Yes = = 


Table 1: Comparison of mClock with existing scheduling techniques 
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Figure 3: Allocation of IOPS to various VMs as the 
overall throughput changes 


resources, such as Stonehenge [23] SFQ(D) [24], Ar- 
gon [41], and Aqua [48]. Many of these algorithms are 
variants of weighted fair queuing mechanisms (Virtual 
Clock [50], WFQ [13], PGPS [29], WF7Q [10], SCFQ 
[15], Leap Forward [38], SFQ [18] and Latency-rate 
scheduling [33]) proposed in the networking literature, 
adapted to handle various storage-specific concerns such 
as concurrency, minimizing seek delays and improving 
throughput. 

The goal of these algorithms is to allocate through- 
put or bandwidth in proportion to the specified weights 
of the clients. Second is the class of algorithms that 
provide support for latency-sensitive applications along 
with proportional sharing. These algorithms include 
SMART [28], BVT [14], pClock [22], Avatar [49] and 
service curve based techniques [12, 27,31, 36]. Third 
is the class of algorithms that support reservation along 
with proportional allocation, such as Rialto [25], ESX 
memory management [44] and other reservation based 
CPU scheduling methods [17, 34,35]. Table 1 provides 
a quick comparison of mClock with existing algorithms 
in the three categories. 


2.1 Scheduling Goals of mClock 


We first discuss a simple example describing the 
scheduling policy of mClock. As mentioned earlier, 
three parameters are specified for each VM in the sys- 
tem: a share or weight represented by wj, a reservation 
r;, and a limit l;. We assume these parameters are exter- 
nally provided; determining the appropriate parameter 
settings to meet application requirements is an important 
but separate problem, outside the scope of this paper. We 


also assume that the system includes an admission con- 
trol component that ensures that the system capacity is 
adequate to serve the aggregate minimum reservations 
of all admitted clients. The behavior of the system if the 
assumption does not hold is discussed later in the sec- 
tion, along with alternative approaches. 


Consider a simple setup with three VMs: one sup- 
porting remote desktop (RD), one running an Online 
Transaction Processing (OLTP) application and a Data 
Migration (DM) VM. The RD VM has a low through- 
put requirement but needs low IO latency for usability. 
OLTP runs a transaction processing workload requiring 
high throughput and low IO latency. The data migration 
workload requires high throughput but is insensitive to 
IO latency. Based on these requirements, the shares for 
RD, OLTP, and DM can be assigned as 100, 200, and 
300 respectively. To provide low latency and a minimum 
degree of responsiveness, reservations of 250 IOPS each 
are specified for RD and OLTP. An upper limit of 1000 
IOPS is set for the DM workload so that it cannot con- 
sume all the spare bandwidth in the system and cause 
high delays for the other workloads. The values chosen 
here are somewhat arbitrary, but were selected to high- 
light the use of various controls in a diverse workload 
scenario. 


First consider how a conventional proportional sched- 
uler would divide the total throughput 7 of the storage 
device. Since throughput is allocated to VMs in pro- 
portion to their weights, an active VM v; will receive 
a throughput T x (w;/;w;), where the summation is 
over the weights of the active VMs (i.e. those with at 
least one pending IO). If the storage device’s through- 
put is 1200 IOPS in the above example, RD will re- 
ceive 200 IOPS, which is below its required minimum 
of 250 IOPS. This can lead to a poor experience for the 
RD user, even though there is sufficient system capac- 
ity for both RD and OLTP to receive their reservations 
of 250 IOPS. In our model, VMs always receive service 
between their minimum reservation and maximum limit 
(as long as system throughput is at least the aggregate of 
the reservations of active VMs). 


In this case, mclock would provide RD with its min- 
imum reservation of 250 IOPS and the remaining 950 
IOPS would be divided between OLTP and DM in the 
ratio 2 : 3, resulting in allocations of 380 and 570 IOPS 
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respectively. Figure 3 shows the IOPS allocation to the 
three VMs in the example above, for different values of 
the system throughput, T. For T between 1500 and 2000 
IOPS, the throughput is shared between RD, OLTP, and 
DM in proportion to their weights (1 : 2 : 3), since none 
of them will exceed their limit or fall below the reser- 
vation. If T > 2000 IOPS, then DM will be capped at 
1000 IOPS because its share of T /2 is higher than its up- 
per limit, and the remainder is divided between RD and 
OLTP in the ratio 1 : 2. If the total throughput T drops 
below 1500 IOPS, the allocation of RD bottoms out at 
250 IOPS, and similarly at T < 875 IOPS, OLTP also 
bottoms out at 250 IOPS. Finally, for T < 500 IOPS, the 
reservations of RD and OLTP cannot be met; the avail- 
able throughput will be divided equally between RD and 
OLTP (since their reservations are the same) and DM 
will receive no service. The last case should be rare if 
the admission controller estimates the overall through- 
put conservatively. 

The allocation to a VM varies dynamically with 
the current throughput T and the set of active VMs. 
At any time, the VMs are partitioned into three sets: 
reservation-clamped (&), limit-clamped (&) or propor- 
tional (#), based on whether their current allocation 
is clamped at the lower or upper bound or is in be- 
tween. If T is the current throughput, we define Tp = 
T —Yiew~rj — Ujev!;. The allocation y; made to active 
VM v; for Tp > 0, is given by: 


Tj VIE & 
% = lj yeZ (1) 
Tp x (wi/Ljewwj) Vie FP 
and 
Le = Ss (2) 


When the system throughput T is known, the alloca- 
tions y; can be computed explicitly. Such explicit com- 
putation is sometimes used for calculating CPU time al- 
locations to virtual machines with service requirement 
specifications similar to these. When a VM exits or is 
powered on at the host, new service allocations are com- 
puted. In the case of a storage array, T is highly de- 
pendent on the presence of other hosts and the work- 
load presented to the storage device. Since the through- 
put varies dynamically, the storage scheduler cannot rely 
upon service allocations computed at VM entry and exit 
times. The mClock scheduler ensures that the goals in 
Eq. (1) and (2) are satisfied continuously, even as the 
system’s throughput varies, using a novel, lightweight 
tagging scheme. 

Clearly, a feasible allocation is possible only if the ag- 
gregate reservation )' rj; does not exceed the total sys- 
tem throughput 7. When 7p < 0, the system through- 
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put is insufficient to meet the reservations; in this case 
mClock simply gives each VM throughput proportional 
to its reservation. This may not always be the desired be- 
havior. VMs without a reservation may be starved in this 
case, but this problem can be easily avoided by adding 
a small default reservation for all VMs. In addition, one 
can add priority control to meet reservations based on 
priority levels. Exploring these options further is left to 
future work. 


2.2 Proportional Share Algorithms 


A number of approaches such as Stonehenge [23], 
SFQ(D) [24] and Argon [41] have been proposed for 
proportional sharing of storage between applications. 
Wang and Merchant [45] extended proportional sharing 
to distributed storage. Argon [41] and Aqua [48] pro- 
pose service-time-based disk allocation to provide fair- 
ness as well as high efficiency. Brandt et al. [47] have 
proposed Hierarchical Disk Sharing, which uses hier- 
archical token buckets to provide isolation and band- 
width reservation among clients accessing the same disk. 
However, measuring per-request service times in our en- 
vironment is difficult because multiple requests will typ- 
ically be pending at the storage device. 

Overall, none of these algorithms offers support for 
the combination of shares, reservations, and limits. 
Other methods for resource management in virtual clus- 
ters [16,39] have been proposed, but they mainly focus 
on CPU and memory resources and do not address the 
challenges raised by variable capacity that mClock does. 


2.3. Latency-sensitive Application Support 


Several existing algorithms provide support for con- 
trolling the response time of latency-sensitive applica- 
tions, but not strict latency guarantees or explicit la- 
tency targets. In the case of CPU scheduling, BVT [14], 
SMART [28], and lottery scheduling [37, 43] provide 
proportional allocation, latency-reducing mechanisms, 
and methods to handle priority inversion by exchanging 
tickets. Borrowed Virtual Time [14] and SMART [28] 
can give a short-term advantage to latency-sensitive ap- 
plications by shifting their virtual tags relative to the 
other applications. pClock [22] and service-curve based 
methods [12, 27,31, 36] decouple latency and through- 
put requirements, but like the other methods also do not 
support reservations and limits. 


2.4 Reservation-Based Algorithms 


For CPU scheduling and memory management, several 
approaches have been proposed for integrating reserva- 
tions with proportional-share allocations [17, 34,35]. In 
these models, clients either receive a guaranteed frac- 
tion of the server capacity (reservation-based clients) or 
a share (ratio) of the remaining capacity after satisfying 
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reservations (proportional-share-based clients). A stan- 
dard proportional-share scheduler can be used in con- 
junction with an allocator that adjusts the weights of the 
active clients whenever there is a client arrival or depar- 
ture. Guaranteeing minimum allocations for CPU time 
is relatively straightforward since its capacity (in terms 
of cycles/sec) is fixed and known, and allocating a given 
proportion would guarantee a certain minimum amount. 
The same idea does not apply to storage allocation where 
system throughput can fluctuate. 

In our model the clients are not statically par- 
titioned into reservation-based or proportional-share- 
based clients. Our model automatically modifies the en- 
titlement of a client when service capacity changes due 
to changes in the workload characteristics or due to the 
arrival or departure of clients. The entitlement is at least 
equal to the reservation and can be higher if there is suf- 
ficient capacity. Since 2003, the VMware ESX Server 
has provided reservations and proportional-share con- 
trols for both CPU and memory resources in a commer- 
cial product [8,42,44]. These mechanisms support the 
same rich set of controls as in mClock, but do not handle 
varying service capacity. 

Finally, operating system based frameworks like Ri- 
alto [25] provide fixed reservations for known-capacity 
CPU service, while allowing additional service requests 
to be honored on an availability basis. Rialto requires re- 
computation of an allocation graph on each new arrival, 
which is then used for CPU scheduling. 


3. mClock Algorithm 


Tag-based scheduling underlies many previously pro- 
posed fair-schedulers [10, 13, 15, 18]: all requests are as- 
signed tags and scheduled in order of their tag values. 
For example, an algorithm can assign tags spaced by in- 
crements of 1/w; to successive requests of client i; if all 
requests are scheduled in order of their tag values, the 
clients will receive service in proportion to w;. In order 
to synchronize idle clients with the currently active ones, 
these algorithms also maintain a global tag value com- 
monly known as global virtual time or just virtual time. 
In mClock, we extend this notion to use multiple tags 
based on three controls and dynamically decide which 
tag to use for scheduling, while still synchronizing idle 
clients. 

The intuitive idea behind the mClock algorithm is to 
logically interleave a constraint-based scheduler and a 
weight-based scheduler in a fine-grained manner. The 
constraint-based scheduler ensures that VMs receive at 
least their minimum reserved service and no more than 
the upper limit in a time interval, while the weight-based 
scheduler allocates the remaining throughput to achieve 
proportional sharing. The scheduler alternates between 
phases during which one of these schedulers is active to 


Weight of VM v; 
Reservation of VM v; 
Maximum service allowance (Limit) for v; 


Table 2: Symbols used and their descriptions 





maintain the desired allocation. 

mClock uses two main ideas: multiple real-time 
clocks and dynamic clock selection. Each VM IO re- 
quest is assigned three tags, one for each clock: a reser- 
vation tag R, a limit tag L, and a proportional share tag P 
for weight-based allocation. Different clocks are used to 
keep track of each of the three controls, and tags based 
on one of the clocks are dynamically chosen to do the 
constraint-based or weight-based scheduling. 

The scheduler has three main components: (7) Tag As- 
signment (i) Tag Adjustment and (iii) Request Schedul- 
ing. We will explain each of these in more detail below. 
Tag Assignment: This routine assigns R, L and P tags 
to a request r from VM vy, arriving at time f. All the tags 
are assigned using the same underlying principle, which 
we illustrate here using the reservation tag. The R tag 
assigned to this request is the higher of the arrival time 
or the previous R tag + 1/r;. That is: 


R= max{R)! +1/r;, Current time} (3) 


This gives us two key properties: first, the R tags of 
a continuously backlogged VM are spaced 1/r; apart. 
In an interval of length T, a backlogged VM will have 
about T x r; requests with R tag values in that interval. 
Second, if the current time is larger than this value due 
to v; becoming active after a period of inactivity, the re- 
quest is assigned an R tag equal to the current time. Thus 
idle VMs do not gain any idle credit for future service. 
Similarly, the Z tag is set to the maximum of the cur- 
rent time and (L'~' + 1/1;). The L tags of a backlogged 
VM are spaced out by 1 //;. Hence, if the L tag of the first 
pending request of a VM is less than the current time, it 
has received less than its upper limit at this time. A limit 
tag higher than the current time would indicate that the 
VM has received its limit and should not be scheduled. 
The proportional share tag P’ is also the larger of the 
arrival time of the request and (P’—! + 1/w;) and subse- 
quent backlogged requests are spaced by 1 /wj. 
Tag Adjustment: Tag adjustment is used to calibrate 
the proportional share tags against real time. This is re- 
quired whenever an idle VM becomes active again. In 
virtual time based schedulers [10, 15] this synchroniza- 
tion is done using global virtual time. The initial P tag 
value of a freshly active VM is set to the current time, 
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but the spacing of P tags after that is determined by the 
relative weights of the VMs. After the VM has been ac- 
tive for some time, the P tag values become unrelated to 
real time. This can lead to starvation when a new VM 
becomes active, since the existing P tags are unrelated 
to the P tag of the new VM. Hence existing P tags are 
adjusted so that the smallest P tag matches the time of 
arrival of the new VM, while maintaining their relative 
spacing. In the implementation, when a VM is acti- 


Algorithm 1: Components of mClock algorithm 
Max_QueueDepth = 32; 


RequestArrival (request r, time t, vm v;) 
begin 
if v; was idle then 

/* Tag Adjustment */ 

minPtag = minimum of all P tags; 

foreach active VM v; do 

LL P; — = minPtag — t; 

/* Tag Assignment */ 
Ri = max{R)! + 1/r;, t} /* Reservation tag */ 
Li =max{L)!+1/l;, t} /* Limit tag */ 
Pr= max{P/—! +1/wj, t}/* Shares tag */ 
ScheduleRequest(); 








end 


ScheduleRequest () 

begin 

if Active IOs > Max_QueueDepth then 

|_ return; 

Let E be the set of requests with R tag < t 

if E not empty then 

/* constraint-based scheduling */ 

select IO request with minimum R tag from 

E 

else 

/* weight-based scheduling */ 

Let E’ be the set of requests with L tag < t 

if E’ not empty OR Active JOs == 0 then 
select IO request with minimum P tag 
from E’ 
/* Assuming request belong to VM v, */ 
Subtract 1/r; from R tags of VM vx 


if JO request selected != NULL then 
| Active IOs++; 








end 


RequestCompletion (request r, vm v;) 
Active_IOs —— ; 
ScheduleRequest(); 


vated, we assign it an offset equal to the difference be- 
tween the effective value of the smallest existing P tag 
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and the current time. During scheduling, the offset is 
added to the P tag to obtain the effective P tag value. 
The relative ordering of existing P tags is not altered by 
this transformation; however, it ensures that the newly 
activated VMs compete fairly with existing VMs. 
Request Scheduling: mClock needs to check three dif- 
ferent tags to make its scheduling decision instead of 
a single tag in previous algorithms. As noted earlier, 
the scheduler alternates between constraint-based and 
weight-based phases. First, the scheduler checks if there 
are any eligible VMs with R tags no more than the cur- 
rent time. If so, the request with smallest R tag is dis- 
patched for service. This is defined as the constraint- 
based phase. This phase ends (and the weight-based 
phase begins) at a scheduling instant when all the R tags 
exceed the current time. 

During a weight-based phase, all VMs have received 
their reservations guaranteed up to the current time. The 
scheduler therefore allocates server capacity to achieve 
proportional service. It chooses the request with small- 
est P tag, but only from VMs which have not reached 
their limit (whose L tag is smaller than the current 
time). Whenever a request from VM v1; is scheduled in 
a weight-based phase, the R tags of the outstanding re- 
quests of v; are decreased by 1/r;. This maintains the 
condition that R tags are always spaced apart by 1/7r;, so 
that reserved service is not affected by the service pro- 
vided in the weight-based phase. Algorithm | provides 
pseudo code of various components of mClock. 


3.1 Storage-specific Issues 


There are several storage-specific issues that an IO 
scheduler needs to handle: IO bursts, request types, IO 
size, locality of requests and reservation settings. 

Burst Handling. Storage workloads are known to be 
bursty, and requests from the same VM often have a high 
spatial locality. We help bursty workloads that were idle 
to gain a limited preference in scheduling when the sys- 
tem next has spare capacity. This is similar to some of 
the ideas proposed in BVT [14] and SMART [28]. How- 
ever, we do it in a manner so that reservations are not 
impacted. 

To accomplish this, we allow VMs to gain idle cred- 
its. In particular, when an idle VM becomes active, we 
compare the previous P tag with current time ¢ and al- 
low it to lag behind t by a bounded amount based on 
a VM-specific burst parameter. Instead of setting the P 
tag to the current time, we set it equal tot — 0; *(1/wj). 
Hence the actual assignment looks like: 


Pr = max{P"! +1/wi, t— o;/wi} 


The parameter o; can be specified per VM and deter- 
mines the maximum amount of credit that can be gained 
by becoming idle. Note that adjusting only the P tag 
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has the nice property that it does not affect the reserva- 
tions of other VMs; however if there is spare capacity in 
the system, it will be preferentially given to the VM that 
was idle. This is because the R and L tags have strict 
priority over the P tags, so adjusting P tags cannot affect 
the constraint-based phase of the scheduler. 

Request Type. mClock treats reads and writes iden- 
tically. In practice writes show lower latency due to 
write buffering in the disk array. However doing any 
re-ordering of reads before writes for a single VM can 
lead to an inconsistent state of the virtual disk on a crash. 
Hence mClock schedules all IOs within a VM in a FCFS 
order without distinguishing between reads and writes. 
IO size. Since larger IO sizes take longer to complete, 
differently-sized IOs should not be treated equally by the 
IO scheduler. We propose a technique to handle large- 
sized IOs during tagging. The IO latency with n random 
outstanding IOs with an IO size of S each can be written 
as: 


Lat = n(Tn + S/B peak) (4) 


Here 7, denotes the mechanical delay due to seek and 
disk rotation and Byeax denotes the peak transfer band- 
width of a disk. Converting the latency observed for an 
10 of size S; to an IO of a reference size S2, keeping 
other factors constant would give: 


So Si 
Laty = Lat, * (1+ ————) /(1++ —————) ©) 
( Tr, X Bye’ ' Tin X Bpoa” 


For a small reference IO size of 8KB and using typical 
values for mechanical delay 7,, = Sms and peak trans- 
fer rate, Byeak = 60 MB/s, the numerator = Lat*(1 
+ 8/300) = Lat;. So, for tagging purposes, a sin- 
gle request of IO size S is treated as equivalent to: 
(1+ S/(Tm X Boeak)) IO requests. 

Request Location. mClock can detect sequentiality 
within a VM’s workload, but in most virtualized envi- 
ronments the IO stream seen by the underlying storage 
may not be sequential due to a high degree of multiplex- 
ing. mClock improves the overall efficiency of the sys- 
tem by scheduling IOs with high locality as a batch. A 
VM is allowed to issue IO requests in a batch as long 
as the requests are close in logical block number space 
(e.g., within 4 MB). Also the size of batch is bounded by 
a configurable parameter (set to 8). 

This optimization impacts the time granularity over 
which reservations are met. The batching of IOs is lim- 
ited to a small number, typically 8. so for N VMs, the 
delay in meeting reservations can be 8N IOs. A typical 
number of VMs/host is 10-15, so this can delay reserva- 
tion guarantees in the short term by the time taken to do 
roughly 100 IOs. Note that the benefit of batching and 
improved efficiency is distributed among all the VMs in- 
stead of giving it just to the VM with high sequentiality. 


It may be preferable to allocate the benefit of locality to 
the concerned VM; this is deferred to future work. 
Reservation Setting. Admission control is a well 
known and difficult problem for storage devices due to 
their stateful nature and dependence of the throughput 
on the workload. We propose the simple approach of us- 
ing the worst case IOPS from a storage device as an up- 
per bound on sum of reservations for admission control. 
For example, an enterprise FC disk can service 200 to 
250 random IOPS and a SATA disk can do roughly 80- 
100 IOPS. Based on the number and type of disk drives 
backing a storage LUN, one can obtain a conservative 
estimate of reservable throughput. This is what we have 
used to set parameters in our experiments. Also in order 
to set the reservations to meet an application’s latency 
for a certain number of outstanding IOs, we use Little’s 
law: 

IOPS = Outstanding IOs/Latency (6) 


Thus, for an application that typically keeps 8 IOs out- 
standing and requires 25 ms average latency, the reser- 
vation should be set to 8 / 0.025 = 320 IOPS. 


3.2 Distributed mClock 


Cluster-based storage systems are emerging as a cost- 
effective, scalable alternative to expensive, centralized 
disk arrays. By using commodity hardware (both hosts 
and disks) and using software to glue together the stor- 
age distributed across the cluster, these systems allow 
for lower cost and more flexible provisioning than con- 
ventional disk arrays. The software can be designed to 
compensate for the reliability and consistency issues in- 
troduced by the distributed components. 

Several research prototypes (e.g., CMU’s Ursa Mi- 
nor [9], HP Labs’ FAB [30], IBM’s Intelligent 
Bricks [46]) have been built, and several companies 
(such as LeftHand [4], Seanodes [6]) are offering iSCSI- 
based storage devices using local disks at virtualized 
hosts. In this section, we extend mClock to run on each 
storage server, with minimal communication between 
the servers, and yet provide per-VM globally (cluster- 
wide) proportional service, reservations, and limits. 


3.2.1 dmClock Algorithm 


dmClock runs a modified version of mClock at each 
server. There is only one modification to the algorithm to 
account for the distributed model in the Tag-Assignment 
component. During tag assignment each server needs to 
determine two things: the aggregate service received by 
the VM from all the servers in the system and the amount 
of service that was done as part of reservation. This in- 
formation will be provided implicitly by the host run- 
ning a VM by piggybacking two integers p; and 6; with 
each request that it forwards to a storage server s;. Here 
6; denotes number of IO requests from VM v; that have 
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completed service at all the servers between the previous 
request (from v,) to the server s; and the current request. 
Similarly, p; denotes the number of IO requests from v; 
that have been served as part of constraint-based phase 
between the previous request to s; and the current re- 
quest. This information can be easily maintained by the 
host running the VM. The host forwards the values of 
p; and 6; along with v,;’s request to a server. (Note that 
for the single server case, p and 6 will always be 1.) 
In the Tag-Assignment routine, these values are used to 
compute the tags as follows: 


Ri = max{R7' +pi/ri, t} 
LEo= max{i?'+6/h;, t} 
Pr = max{P!!+6;/wi, t} 





Hence, the new request may receive a tag further into 
the future, to reflect the fact that v; has received addi- 
tional service at other servers. The greater the value of 
6, the lower the priority the request has for service. Note 
that this does not require any synchronization among the 
storage servers. The remainder of the algorithm remains 
unchanged. The values of p and 6 may, in the worst 
case, be inaccurate by up to | request at each of the other 
servers. However, the dmClock algorithm does not re- 
quire complex synchronization between the servers [32]. 


4 Performance Evaluation 


In this section, we present results from a detailed evalu- 
ation of mClock using a prototype implementation in the 
VMware ESX server hypervisor [7,40]. The changes 
required were small: the overall implementation took 
roughly 200 lines of C code in order to modify an ex- 
isting scheduling framework. The resulting scheduler is 
lightweight, which is important because it is on the crit- 
ical path for IO issues and completions. We examine the 
following key questions about mClock: 

(1) Why is mClock needed? (2) Can mClock allo- 
cate service in proportion to weights, while meeting the 
reservation and limit constraints? (3) Can mClock han- 
dle bursts effectively and reduce latency by giving idle 
credit? (4) How effective is dmClock in providing isola- 
tion among dynamic workloads in a distributed storage 
environment? 


4.1 Experimental Setup 


We implemented mClock by modifying the SCSI 
scheduling layer in the IO stack of VMware ESX server 
hypervisor to construct our prototype. The ESX host 
was a Dell Poweredge 2950 server with 2 Intel Xeon 
3.0 GHz dual-core processors, 8GB of RAM and two 
Qlogic HBAs connected to an EMC CLARION CX3- 
40 storage array over FC SAN. We used two different 
storage volumes: one hosted on a 10 disk RAID 0 disk 
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group and another on a 10 disk, RAID 5 disk group. The 
host was configured to keep 32 IOs pending per LUN at 
the array, which is the default setting. 

We used a diverse set of workloads, using different 
operating systems, workload generators, and configura- 
tions, to verify that mClock is robust under a variety 
of conditions. We used two kinds of VMs: (1) Linux 
(RHEL) VMs, each with a 10GB virtual disk, one VCPU 
and 512 MB memory, and (2) Windows server 2003 
VMs, each with a 16GB virtual disk, one VCPU and 1 
GB of memory. The disks hosting the operating systems 
for VMs were on a different storage LUN. 

Three parameters were configured for each VM: a 
minimum reservation 7; IOPS, a global weight w;, and 
maximum limit /; IOPS. The workloads were gener- 
ated using Iometer [5] in the Windows server VMs 
and our own micro-workload generator in the Linux 
RHEL VMs. For both cases, the workloads were spec- 
ified using IO sizes, the percentage of reads, the per- 
centage of random IOs, and the number of concur- 
rent IOs. We used 32 concurrent IOs per workload in 
all experiments, unless otherwise stated. In addition 
to these micro-benchmark workloads, we used macro- 
benchmark workloads generated using Filebench [26]. 
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Figure 5: mClock limits the throughput of VM2 and 
VM3 to 400 and 500 IOPS as desired. 


4.1.1 Limit Enforcement 


First we show the need for the limit control by demon- 
strating that pure proportional sharing cannot guarantee 
the specified number of IOPS and latency to a VM. We 
experimented with three workloads similar to those in 
the example of Section 2: RD, OLTP and DM. 

RD is a bursty workload sending 32 random IOs (75% 
reads) of 4KB size every 250 ms. OLTP sends 8KB ran- 
dom IOs, 75% reads, and keeps 16 IOs pending at all 
times. The data migration workload DM does 32KB se- 
quential reads, and keeps 32 IOs pending at all times. 
RD and OLTP are latency-sensitive workloads, requiring 
a response time under 30ms, while DM is not sensitive 
to latency. Accordingly, we set the weights in the ratio 
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Figure 4: Average throughput and latency for RD, OLTP and DM workloads, with weights = 2:2:1. At t=140 the 
limit for DM is set to 300 IOPS. mClock is able to restrict the DM workload to 300 IOPS and improve the latency of 


RD and OLTP workloads. 
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Figure 6: Five VMs with weights in ratio 1:1:2:2:2. VMs are started at 60 sec intervals. The overall throughput 
decreases as more VMs are added. mClock enforces reservations and SFQ only does proportional allocation. 


2:2:1 for the RD, OLTP, and DM workloads. First, we 
ran them with zero reservations and no limits in mClock, 
which is equivalent to running them with a standard fair 
scheduler such as SFQ(D) [24]. The throughput and 
latency achieved is shown in Figures 4(a) and (b), be- 
tween times 60 and 140sec. Since RD was not fully 
backlogged, and OLTP had only 16 concurrent IOs, the 
work-conserving scheduler gave all the remaining queue 
slots (16 of them) to the DM workload. As a result, RD 
and OLTP got less than the specified proportion of IO 
throughput, while DM received more. Since the device 
queue was always heavily occupied by IO requests from 
DM, the latency seen by RD and OLTP was higher than 
desirable. We also experimented with other weight ra- 
tios (which are not shown here for lack of space), but saw 
no significant improvement, because the primary cause 
of the poor performance seen by RD and OLTP was that 
there were too many IOs from DM in the device queue. 


To provide better throughput and lower latency to RD 
and OLTP workloads, we changed the upper limit for 
DM to 300 IOs (from unlimited) at tf = 140sec. This 
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caused the OLTP workload to see a 100% increase in 
throughput and the latency was reduced by half (36 ms 
to 16 ms). The RD workload also saw lower latency, 
while its throughput remained equal to its demand. This 
result shows that using limits with proportional sharing 
can be quite effective in reducing contention for criti- 
cal workloads, and this effect cannot be produced using 
proportional sharing alone. 


Next, we did an experiment to show that mClock ef- 
fectively enforces limits in a more dynamic setting with 
workloads arriving at different times. Using Iometer on 
Windows Server VMs, we ran three workloads (VM1, 
VM2, and VM3), each generating 16KB random reads. 
We set the weights in the ratio 1:1:2, with limits of 400 
IOPS on VM2 and 500 IOPS on VM3. We began with 
just VM1 and a new workload was started every 60 sec- 
onds. The storage device had a capacity of about 1600 
random reads per second. Without the limits and based 
on the weights alone, we would expect the applications 
to receive 800 IOPS each when VM1 and VM2 are run- 
ning, and 400, 400, and 800 IOPS respectively when 
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Figure 7: Average throughput for VMs using SFQ(D) and mClock. mClock is able to restrict the allocation of VM2 


to 700 IOPS and always provide at least 250 IOPS to VM4. 


VM1, VM2, and VM3 are running together. 

Figure 5 shows the throughput obtained by each of the 
workloads. When we added the VM2 (at time 60sec), it 
received only 400 IOPS based on its limit, and not the 
800 IOPS it would have received based on the weights 
alone. When we started VM3 (at time 120sec), it re- 
ceived only its maximum limit, 500 IOPS, again smaller 
than its throughput share based on the weights alone. 
This shows that mClock is able to limit the throughput 
of VMs based on specified upper limits. 


4.1.2 Reservations Enforcement 


To test the ability of mClock to enforce reservations, we 
used a combination of 5 workloads, VM1 — VMS, all 
generated using Iometer on Windows Server VMs. Each 
workload maintained 32 outstanding IOs, all 16 KB ran- 
dom reads, at all times. We set their shares to the ratio 
1:1:2:2:2. VMI required a minimum of 300 IOPS, VM2 
required 250 IOPS, and the rest had no minimum re- 
quirement. To demonstrate again the working of mClock 
in a dynamic environment, we began with just VM1, and 
a new workload was started every 60 seconds. 

Figures 6(a) shows the overall throughput observed 
by the host using SFQ(D=32) and mClock. As the 
number of workloads increased, the overall throughput 
from the array decreased because the combined work- 
load spanned larger numbers of tracks on the disks. 
Figures 6(b) and (c) show the throughput obtained by 
each workload using SFQ(D=32) and mClock respec- 
tively. When we used SFQ(D), the throughput of each 
VM decreased with increasing load, down to 160 IOPS 
for VM1 and VM2, while the remaining VMs received 
around 320 IOPS. In contrast, mClock provided 300 
IOPS to VM1 and 250 IOPS to VM2, as desired. In- 
creasing the throughput allocation also led to a smaller 
latency (as expected) for VM1 and VM2, which would 
not have been possible just using proportional shares. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


VM 
VMI_| 4K, 75%, 100% | 0_| MAX 
VM2 | 8K, 90%, 80% | 0 | 700 

VM3_| 16K, 75%, 20% | _0_| MAX | 
vMa MAX 





Table 3: VM workloads characteristics and parameters 


4.1.3. Diverse VM Workloads 


In the experiments above, we used mostly homoge- 
neous workloads for ease of exposition and understand- 
ing. To demonstrate the effectiveness of mClock with 
a non-homogeneous combination of workloads, we ex- 
perimented with workloads having very different IO 
characteristics. We used four workloads, generated us- 
ing Iometer on Windows VMs, each keeping 32 IOs 
pending at all times. The workload configurations and 
the resource control settings (reservations, limits, and 
weights) are shown in Table 3. 

Figures 7(a) and (b) show the throughputs allocated 
by SFQ(D) (weight-based allocation) and by mClock for 
these workloads. mClock was able to restrict VM2 to 
700 IOPS, as desired, when only two VMs were doing 
IOs. Later, when VM4 became active, mClock was able 
to meet the reservation of 250 IOPS for it, whereas SFQ 
only provided around 190 IOPS. While meeting these 
constraints, mClock was able to keep the allocation in 
proportion to the weights of the VMs; for example, VM1 
got twice as many IOPS as VM3 did. 

We next used the same workloads to demonstrate how 
an administrator may determine the reservation to use. If 
the maximum latency desired and the maximum concur- 
rency of the application is known, then the reservation 
can be simply estimated using Little’s law as the ratio of 
the concurrency to the desired latency. In our case, if it is 
desired that the latency not exceed 65ms, the reservation 
can be computed as 32/0.065 = 492, since the number 
of concurrent IOs from each application is 32. First, we 
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Figure 8: (a) Without mClock, VM2 missed its minimum requirement when WinVM started (b) With mClock, both 
OLTP workloads got their reserved IOPS despite WinVM workload (c) Application-level metrics: ops/s, avg Latency 


7;=512, [IOPS,ms] 


VM r=, [OPS, ms] 








VMI 330, 96ms 490, 68ms 
VM2 390, 82ms 496, 64ms 
VM3 660, 48ms 514, 64ms 
VM4 665, 48ms 530, 65ms 


Table 4: mClock provided low latencies to VM1 and 
VM2 and throughputs close to the reservation when the 
reservations were changed from r; = 1 to 512 IOPS. 


ran the four VMs together with a reservation 7; = | each, 
and weights in the ratio 1:1:2:2. 

The throughput (IOPS) and latency received by each 
in this simultaneous run are shown in Table 4. Note that 
workloads received IOPS in proportion to their weights, 
but the latencies of VM1 and VM2 were much higher 
than desired. We then set the reservation (r;) for each 
VM to be 512 IOPS; the results are shown in the last col- 
umn of Table 4. Note that first two VMs received higher 
IOPS of around 500 instead of 330 and 390, which is 
close to their reservation targets. The latency is also 
close to the expected value of 65ms. The other VMs saw 
a corresponding decline in their throughput. The reser- 
vation targets of VM1 and VM2 were not entirely met 
because the overall throughput was slightly smaller than 
the sum of reservations. This experiment demonstrates 
that mClock is able to provide a strong control to stor- 
age admins to meet their IOPS and latency targets for a 
given VM. 


4.1.4 Bursty VM Workloads 


Next, we experimented with the use of idle credits given 
to a workload for handling bursts. Recall that idle credits 
allow a workload to receive service in a burst only if the 
workload has been idle in the past and the reservations 
for all VMs have been met. This ensures that if an ap- 
plication is idle for a while, it gets preference when next 
there is spare capacity in the system. In this experiment, 
we used two workloads generated with Iometer on Win- 
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VM | o=1, [IOPS, ms] | o=64, [IOPS,ms] 
VM1 | 312, 49ms 316, 30.8ms 
VM2 | 2420, 13.2ms 2460, 12.9ms 





Table 5: The bursty workload (VM1) saw an improved 
latency when given a higher idle credit of 64. The overall 
throughput remained unaffected. 


dows Server VMs. The first workload was bursty, gener- 
ating 128 IOs every 400ms, all 4KB reads, 80% random. 
The second was steady, producing 16 KB reads, 20% of 
them random and the rest sequential, with 32 outstand- 
ing IOs. Both VMs had equal shares, no reservation, and 
no limit imposed on the throughput. We used idle-credit 
(oO) values of 1 and 64 for our experiment. 

Table 5 shows the IOPS and average latency obtained 
by the bursty VM for the two settings of the idle credit. 
The number of IOPS were almost equal in either case 
because idle credits do not impact the overall bandwidth 
allocation over time, and VMI had a bounded request 
rate. VM2 also saw almost the same IOPS for the two 
settings of idle credits. However, we notice that the la- 
tency seen by the bursty VM1 decreased as we increased 
the idle credits. VM2 also saw a similar or a slightly 
smaller latency, perhaps due to the increase in efficiency 
of doing several IOs at a time from a single VM, which 
are likely to be spatially closer on the storage device. 

In the extreme, however, a very high setting of idle 
credits can lead to high latencies for non-bursty work- 
loads by distorting the effect of the weights (although 
not the reservations or limits), and so we limit the set- 
ting to a maximum of 256 IOs in our implementation. 
This result indicates that using idle credits is an effec- 
tive mechanism to help lower the latency of bursts. 


4.1.5 Filebench Workloads 


To test mClock with more realistic workloads, we ex- 
perimented with two Linux RHEL VMs running OLTP 
workload using Filebench [26]. Each VMs was config- 
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ured with 1 VCPU, 512 MB of RAM, 10GB database 
disk, and | GB log virtual disk. To introduce throughput 
fluctuation another Windows 2003 VM running Iometer 
was used. The Iometer workload produced 32 concur- 
rent, 16KB random reads. We assigned the weights in 
the ratio 2:1:1 to the two OLTP workloads and the Iome- 
ter workload, respectively, and gave a reservation of 
500 IOPS to each OLTP workload. We initially started 
the two OLTP workloads together and then the Iometer 
workload at t = 115s. 

Figures 8(a) and (b) show the IOPS received by the 
three workloads as measured inside the hypervisor, with 
and without mClock. Without mClock, as soon as the 
Iometer workload started, OLTP2 started missing its 
reservation and received around 250 IOPS. When run 
with mClock, both the OLTP workloads were able to 
achieve their reservations of 500 IOPS. This shows that 
mClock can protect critical workloads from a sudden 
change in the available throughput. The application- 
level metrics — the number of operations/sec and the 
transaction latency reported by Filebench — are sum- 
marized in Figure 8(c). Note that mClock was able to 
provide higher operations/sec and lower latency per op- 
eration in OLTP VMs, even with an increase in the over- 
all IO contention. 


4.2 dmClock Evaluation 


In this section, we present results of a dmClock imple- 
mentation in a distributed storage system. The system 
consisted of multiple storage servers (nodes) — three in 
our experiment. Each node was implemented using a 
virtual machine running RHEL Linux with a 1OGB OS 
disk and a 1OGB experimental disk, from which the data 
was served. Each experimental disk was placed on a 
different LUN backed by RAID-5 group with six disks. 
Thus, each experimental disk could do roughly 1500 
IOPS for a random workload. A single storage device 
shared by all clients, was then constructed by striping 
across all the storage nodes. This configuration repre- 
sents a clustered-storage system where there are multi- 
ple storage nodes, each with dedicated LUNs used for 
servicing IOs. 

We implemented dmClock as a user-space module in 
each server node. The module receives IO requests 
containing IO size, offset, type (read/write), the 6 and 
p parameters, and data in the case of write requests. 
The module can keep up to 16 outstanding IOs (using 
16 threads) to execute the requests, and the requests 
are scheduled on these threads using the dmClock algo- 
rithm. The clients were run on a separate physical ma- 
chine. Each client generated an IO workload for one or 
more storage nodes and also acted as a gateway, piggy- 
backing the 5 and p values onto each request sent to 
the storage nodes. Each client workload consisted of 
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Figure 9: IOPS obtained by the three clients for two dif- 
ferent cases. (a) All clients accessed the servers uni- 
formly, with no reservations. (b) Clients had reserva- 
tions of 800, 1000, and 100 IOPS, respectively. 
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Figure 10: IOPS obtained by the two clients. When c 
was started, c, still met its reservation target. 


8KB random reads with 64 concurrent IOs, uniformly 
distributed over the nodes it used. We used our own 
workload generator here because of the need to add ap- 
propriate 6 and p values to each request. 

In first experiment, we used three clients, {c1,c2,c3}, 
each accessing all three storage nodes. The weights were 
set in the ratio 1:4:6, with no upper limit on the IOPS. 
We experimented with two different cases: (1) No reser- 
vation per client, (2) Reservations of 800, 1000 and 100 
for clients {c,,c2,c3} respectively. These values were 
used to highlight a use case where the allocation based 
on reservations may be higher than the allocation based 
on weights or shares for some clients. The output for 
these two cases is shown in Figure 9 (a) and (b). Case 
(a) shows the overall IO throughput obtained by three 
clients without reservations. As expected, each client 
received total service in proportion to its weight. In case 
(b), dmClock was able to meet the reservation goal of 
800 IOPS for cj, which would have been missed with 
a proportional share scheduler. The remaining through- 
put was divided between clients cz and c3 in the ratio 
2:3 as they respectively received around 1750 and 2700 
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IOPS. Figure 9(b) also shows the IOs done during the 
two phases of the algorithm. 

Next, we experimented with non-uniform accesses 
from clients. In this case we used two clients c,,c2 and 
two storage servers. The reservations were set to 800 
and 1000 IOPS and the weights were again in the ra- 
tio 1:4. c; sent IOs to the first storage node (S;) only 
and we started cz after approximately 40 seconds. Fig- 
ure 10 shows the IOPS obtained by the two clients with 
time. Initially, c; got the full capacity from server S; and 
when C2 was started, c; was still able to get an allocation 
close to its reservation of 800 IOPS. The remaining ca- 
pacity was allocated to c2, which received around 1400 
IOPS. A distributed weight-proportional scheduler [45] 
would have given approximately 440 IOPS to c; and the 
remainder to cz, which would have missed the minimum 
requirement of cj. This shows that even when the ac- 
cess pattern is non-uniform in a distributed environment, 
dmClock is able to meet reservations and assign overall 
IOPS in the ratio of weights to the extent possible. 


5 Conclusions 


In this paper, we presented a novel IO scheduling algo- 
rithm, mClock, that provides per-VM quality of service 
in presence of variable overall throughput. The QoS re- 
quirements for a VM are expressed as a minimum reser- 
vation, a maximum limit, and a proportional share. A 
key aspect of mClock is its ability to enforce such con- 
trols even with fluctuating overall capacity, as shown by 
our implementation in the VMware ESX server hypervi- 
sor. We also presented dmClock, a distributed version of 
our algorithm that can be used in clustered storage sys- 
tem architectures. We implemented dmClock in a dis- 
tributed storage environment and showed that it works 
as specified, maintaining global per-client reservations, 
limits, and proportional shares, even though the sched- 
ulers run locally on the storage nodes. 

The controls provided by mClock should allow 
stronger isolation between VMs. Although we have 
shown the effectiveness for hypervisor IO scheduling, 
we believe that the techniques are quite generic and can 
be applied to array-level scheduling and to other re- 
sources such as network bandwidth allocation as well. 
In our future work, we plan to explore further how to set 
these parameters to meet application-level SLAs. 
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Abstract 


We propose a new timekeeping architecture for virtu- 
alized systems, in the context of Xen. Built upon a feed- 
forward based RADclock synchronization algorithm, it 
ensures that the clocks in each OS sharing the hardware 
derive from a single central clock in a resource effective 
way, and that this clock is both accurate and robust. A 
key advantage is simple, seamless VM migration with 
consistent time. In contrast, the current Xen approach 
for timekeeping behaves very poorly under live migra- 
tion, posing a major problem for applications such as fi- 
nancial transactions, gaming, and network measurement, 
which are critically dependent on reliable timekeeping. 
We also provide a detailed examination of the HPET and 
Xen Clocksource counters. Results are validated using a 
hardware-supported testbed. 


1 Introduction 


Virtualization represents a major movement in the evo- 
lution of computer infrastructure. Its many benefits in- 
clude allowing the consolidation of server infrastructure 
onto fewer hardware platforms, resulting in easier man- 
agement and energy savings. Virtualization enables the 
seamless migration of running guest operating systems 
(guest OSs), which reduces reliance on dedicated hard- 
ware, and eases maintenance and failure recovery. 

Timekeeping is a core service on computing plat- 
forms, and accurate and reliable timekeeping is im- 
portant in many contexts including network measure- 
ment and high-speed trading in finance. Other applica- 
tions where accurate timing is essential to maintain at 
all times, and where virtualization can be expected to 
be used either now or in the future, include distributed 
databases, financial transactions, and gaming servers. 
The emerging market of outsourced cloud computing 
also requires accurate timing to manage and correctly bill 
customers using virtualized systems. 


Software clocks are based on local hardware (oscilla- 
tors), corrected using synchronization algorithms com- 
municating with reference clocks. For cost and conve- 
nience reasons, reference clocks are queried over a net- 
work. 

Since a notion of universally shared absolute time is 
tied to physics, timekeeping poses particular problems 
for virtualization, as a tight ‘real’ connection must be 
maintained across the OSs sharing the hardware. Both 
timekeeping and timestamping rely heavily on hardware 
counters. Virtualization adds an extra layer between the 
hardware and the OSs, which creates additional resource 
contention, and increased latencies, that impact perfor- 
mance. 

In this paper we propose a new timekeeping architec- 
ture for para-virtualized systems, in the context of Xen 
[1]. Using a hardware-supported testbed, we show how 
the current approach using the Network Time Protocol 
(NTP) system is inadequate, in particular for VM migra- 
tion. We explain how the feed-forward based synchro- 
nization adopted by the RADclock [14] allows a depen- 
dent clock paradigm to be used and ensures that all OSs 
sharing the hardware share the same (accurate and ro- 
bust) clock in a resource effective way. This results in 
robust and seamless live migration because each phys- 
ical host machine has its own unique clock, with hard- 
ware specific state, which never migrates. Only a state- 
less clock-reading function migrates. We also provide a 
detailed examination and comparison of the HPET and 
Xen Clocksource counters. 

Neither the idea of a dependent clock, nor the RAD- 
clock algorithm, are new. The key contribution here is to 
show how the feed-forward nature, and stateless clock 
read function, employed by the RADclock, are ideally 
suited to make the dependent clock approach actually 
work. In what is the first evaluation of RADclock in a vir- 
tualized context, we show in detail that the resulting so- 
lution is orders of magnitude better than the current state 
of the art in terms of both average and peak error follow- 
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ing disruptive events, in particular live migration. We 
have integrated our work into the RADclock [14] pack- 
ages for Linux, which now support the architecture for 
Xen described here. 

After providing necessary background in Section 2, 
we motivate our work in depth by demonstrating the in- 
adequacies of the status quo in Section 3. Since hardware 
counters are key to timekeeping in general and to our so- 
lution in particular, Section 4 provides a detailed exam- 
ination of the behavior of counters of key importance to 
Xen. Section 5 describes and evaluates our proposed tim- 
ing architecture on a single physical host, and Section 6 
deals with migration. We conclude in Section 7. 


2 Background 


We provide background on Xen, hardware counters, 
timekeeping, the RADclock and NTP clocks, and com- 
parison methodology. 

To the best of our knowledge there is no directly rel- 
evant peer-reviewed published work on timekeeping in 
virtualized systems. A valuable resource however is [21]. 


2.1 Para-Virtualization and Xen 


All virtualization techniques rely on a hypervisor, which, 
in the case of Xen [1, 2, 9], is a minimal kernel with 
exclusive access to hardware devices. The hypervisor 
provides a layer of abstraction from physical hardware, 
and manages physical resources on behalf of the guests, 
ensuring isolation between them. We work within the 
para-virtualization paradigm, whereby all guest OS’s are 
modified to have awareness of, and access to, the native 
hardware via hypercalls to the hypervisor, which are sim- 
ilar to a system call. It is more challenging to support 
accurate timing under the alternative fully hardware vir- 
tualized paradigm [2], and we do not consider this here. 

Although we work in the context of para-virtualized 
Xen, the architecture we propose has broader applicabil- 
ity. We focus on Linux OS’s as this is the most active 
platform for Xen currently. In Xen, the guest OSs be- 
long to two distinct categories: Dom0 and DomU. The 
former is a privileged system which has access to most 
hardware devices and provides virtual block and network 
devices for the other, DomU, guests. 


2.2 Hardware Counters 


The heart of any software clock is local oscillator hard- 
ware, accessed via dedicated counters. Counters com- 
monly available today include the Time Stamp Counter 
(TSC) [6] which counts CPU cycles!, the Advanced Con- 


ITSC is x86 terminology, other architectures use other names. 
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figuration and Power Interface (ACPI) [10], and the 
High Precision Event Timer (HPET) [5]. 

The TSC enjoys high resolution and also very fast ac- 
cess. Care is however needed in architectures where a 
unique TSC of constant nominal frequency may not ex- 
ist. This can occur because of multiple processors with 
unsynchronized TSC’s, and/or power management ef- 
fects resulting in stepped frequency changes and execu- 
tion interruption. Such problems were endemic in archi- 
tectures such as Intel Pentium III and IV, but have been 
resolved in recent architectures such as Intel Nehalem 
and AMD Barcelona. 

In contrast to CPU counters like the TSC, HPET and 
ACPI are system-wide counters which are unaffected by 
processor speed issues. They are always on and run at 
constant nominal rate, unless the entire system is sus- 
pended, which we ignore here. HPET is accessed via 
a data bus and so has much slower access time than the 
TSC, as well as lower resolution as its nominal frequency 
is about 14.3MHz. ACPI has even lower resolution with 
a frequency of only 3.57MHz. It has even slower access, 
since it is also read via a bus but unlike HPET is not 
memory mapped. 

Beyond the counter ticking itself, power management 
affects all counters through its impact on counter access 
latency, which naturally requires CPU instructions to be 
executed. Recent processors can, without stopping exe- 
cution, move between different P-States where the oper- 
ating frequency and/or voltage are varied to reduce en- 
ergy consumption. Another strategy is to stop processor 
execution. Different such idle states, or C-States CO, Cl, 
C2... are defined, where CO is normal execution, and the 
deeper the state, the greater the latency penalty to wake 
from it [12]. The impact on latency of these strategies is 
discussed in more detail later. 


2.3 Xen Clocksource 


The Xen Clocksource is a hardware/software hybrid 
counter presented to guest OSs by the hypervisor. It 
aims to combine the reliability of a given platform timer 
(HPET here) with the low access latency of the TSC. 
It is based on using the TSC to interpolate between 
HPET readings made on ‘ticks’ of the periodic inter- 
rupt scheduling cycle of the OS (whose period is typi- 
cally Ims), and is scaled to a frequency of approximately 
1GHz. It is a 64-bit cumulative counter, and is effectively 
initialized to zero for each guest when they boot (this is 
implemented by a ‘system time’ variable they keep) and 
monotonically increases. 

The Xen Clocksource interpolation is a relatively 
complex mechanism that accounts for lost TSC ticks 
(it actively overwrites the TSC register) and frequency 
changes of the TSC due to power management (it main- 
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tains a TSC scaling factor which can be used by guests to 
scale their TSC readings). Such compensation is needed 
on some older processors as described in Section 2.2. 


2.4 Clock Fundamentals 


A distinction must be drawn between the software clock 
itself, and timestamping. A clock may be perfect, yet 
timestamps made with it be very inaccurate due to large 
and/or variable access latency. Such timestamping er- 
rors will vary widely depending on context and may er- 
roneously reflect on the clock itself. 


By a raw timestamp we mean a reading of the underly- 
ing counter. For a clock C' which reads C(t) at true time 
t, the final timestamp will be a time in seconds based not 
only on the raw timestamp, but also the clock parameters 
set by the synchronization algorithm. 


A local (scaled) counter is not a suitable clock and 
needs to be synchronized because all counters drift if left 
to themselves: their rate, although very close to constant 
(typically measured to 1 part in 10° or less), varies. Drift 
is primarily influenced by temperature. 


Remote clock synchronization over a network is based 
on a (typically bidirectional) exchange of timing mes- 
sages from an OS to a time server and back, giving rise to 
four timestamps: two made by the OS as the timing mes- 
sage (here an NTP packet) leaves then returns, and two 
made remotely by the time server. Typically, exchanges 
are made periodically: once every poll-period. 


There are two key problems faced by remote synchro- 
nization. The first is to filter out the variability in the 
delays to and from the server, which effectively corrupt 
timestamps. This is the job of the clock synchronization 
algorithm, and it is judged on its ability to do this well 
(small error and small error variability) and consistently 
in real environments (robustness). 


The second problem is that of a fundamental ambigu- 
ity between clock error and the degree of path asymme- 
try. Let A = d! — d! denote the true path asymmetry, 
where d' and d! are the true minimum one-way delays 
to and from the server, respectively; and let r = d! + dt‘ 
be the minimal Round Trip Time (RTT). In the absence 
of any external side-information on A, we must guess a 
value, and A = 0 is typically chosen, corresponding to 
a symmetric path. This allows the clock to be synchro- 
nized, but only up to an unknown additive error lying 
somewhere in the range [—1,7]. This ambiguity cannot 
be circumvented, even in principle, by any algorithm. 
We explore this further under Experimental Methodol- 
ogy below. 


2.5 Synchronization Algorithms 


The ntpd daemon [11] is the standard clock synchro- 
nization algorithm used today. It is a feedback based 
design, in particular since system clock timestamps are 
used to timestamp the timing packets. The existing ker- 
nel system clock, which provides the interface for user 
and kernel timestamping and which is ntpd-oriented, is 
disciplined by ntpd. The final software clock is therefore 
quite a complex system as the system clock has its own 
dynamics, which interacts with that of ntpd via feedback. 
On Xen, ntpd relies on the Xen Clocksource as its under- 
lying counter. 

The RADclock [20] (Robust Absolute and Difference 
Clock) is a recently proposed alternative clock synchro- 
nization algorithm based on a feed-forward design. Here 
timing packets are timestamped using raw packet times- 
tamps. The clock error is then estimated based on these 
and the server timestamps, and subtracted out when the 
clock is read. This is a feed-forward approach, since er- 
rors are corrected based on post-processing outputs, and 
these are not themselves fed back into the next round of 
inputs. In other words, the raw timestamps are indepen- 
dent of clock state. The ‘system clock’ is now stateless, 
simply returning a function of parameters maintained by 
the algorithm. 

More concretely, the (absolute) RADclock is defined 
as C,(t) = N(t)-p+ K — E(t), where N(t) is the raw 
timestamp made at true time ¢, p is a stable estimate of 
average counter period, K is a constant which aligns the 
origin to the required timescale (such as UTC), and E(t) 
is the current estimate of the error of the ‘uncorrected 
clock’ N(t)--+ K which is removed when the clock 
is read. The parameters p, K and F are maintained by 
the clock algorithm (see [20] for details of the algorithm 
itself). The clock reading function simply reads (or is 
passed) the raw timestamp NV for the event of interest, 
fetches the clock parameters, and returns C;,(t). 

The RADclock can use any counter which satisfies ba- 
sic requirements, namely that it be cumulative, does not 
roll over, and has reasonable stability. In this paper we 
provide results using both the HPET and Xen Clock- 
source. 


2.6 Experimental Methodology 


We give a brief description of the main elements of our 
methodology for evaluating clock performance. More 
details can be found in [15]. 

The basic setup is shown in Figure 1. It incorporates 
our own Stratum-1 NTP server on the LAN as the ref- 
erence clock, synchronised via a GPS-corrected atomic 
clock. NTP timing packets flow between the server and 
the clocks in the host machine (two OS’s each with two 
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clocks are shown in the figure) for synchronization pur- 
poses, typically with a poll period of 16 seconds. For 
evaluation purposes, a separate box sends and receives 
a flow of UDP packets (with period 2 seconds) to the 
host, acting as a set of timestamping ‘opportunities’ for 
the clocks under test. In this paper a separate stream was 
sent to each OS in the host, but for a given OS, all clocks 
timestamp the same UDP flow. 

Based on timestamps of the arrival and departure of 
the UDP packets, the testbed allows two kinds of com- 
parisons. 

External: a specialized packet stamping ‘DAG’ card [4] 
timestamps packets just before they enter the NIC of the 
host machine. These can be compared to the timestamps 
for the same packets taken by the clocks inside the host. 
The advantage is an independent assessment; the disad- 
vantage is that there is ‘system’ lying between the two 
timestamping events, which adds a ‘system noise’ to the 
error measurement. 

Internal: clocks inside the same OS timestamp the pack- 
ets back-to-back (thanks to our kernel modifications), so 
subtracting these allows the clocks to be compared. The 
advantage is the elimination of the system noise between 
the timestamps; the disadvantage is that differences be- 
tween the clocks cannot be attributed to any specific 
clock. 

The results appearing in this paper all use the external 
comparison, but internal comparisons were also used as 
a key tool in the process of investigation and validation. 

As far as possible experiments are run concurrently so 
that clocks to be compared experience close to identical 
conditions. For example, clocks in the same OS share 
the very same NTP packets to the time server (and hence 
in particular, share the same poll period). There are a 
number of subtle issues we have addressed regarding the 
equivalence between what the test UDP packets, and the 
NTP packets actually used by the algorithm, ‘see’, which 
depends on details of the relative timestamping locations 
in the kernel. This topic is discussed further below in 
relation to ‘host asymmetry’. 
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Figure |: Testbed and clock comparison methodology. 
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It is essential to note that the delays experienced by 
timing packets have components both in the network, and 
in the host itself (namely the NIC + hardware + OS), 
each with their own minimum RTT and asymmetry val- 
ues. Whereas the DAG card timestamps enable the net- 
work side to be independently measured and corrected, 
the same is not true of the host side component. Even 
when using the same server then, a comparison of dif- 
ferent clocks, which have different asymmetry induced 
errors, is problematic. Although the spread of errors can 
be meaningfully compared, the median errors can only 
be compared up to some limit imposed by the (good but 
not perfect) methodology, which is of the order of | to 
10 us. Despite these limitations, we believe our method- 
ology to be the best available at this time. 


3 Inadequacy of the Status-Quo 


The current timekeeping solution for Xen is built on top 
of the ntpd daemon. The single most important thing 
then regarding the performance of Xen timekeeping is to 
understand the behavior first of ntpd in general, and then 
in the virtualized environment. 

There is no doubt that ntpd can perform well under 
the right conditions. If a good quality nearby time server 
is available, and if ntpd is well configured, then its per- 
formance on modern kernels is typically in the tens of 
microseconds range and can rival that of the RADclock. 
An example in the Xen context is provided in Figure 2, 
where the server is a Stratum-1 on the same LAN, and 
both RADclock and ntpd, running on Dom0O in paral- 
lel, are synchronizing to it using the same stream of 
NTP packets. Here we use the host machine kultarr, a 
2.13GHz Intel Core 2 Duo. Xen selects a single CPU for 
use with Xen Clocksource, which is then used by ntpd. 
Power management is disabled in the BIOS. 

The errors show a similar spread, with an Inter- 
Quartile Range (IQR) of around 10 us for each. Note that 
here path asymmetry effects have not been accounted for, 
so that as discussed above the median errors do not re- 
flect the exact median error for either clock. 

In this paper our focus is on the right architecture 
for timing in virtualized systems, in particular such that 
seamless VM migration becomes simple and reliable, 
and not any performance appraisal of ntpd per se. Ac- 
cordingly, unless stated otherwise we consistently adopt 
the configuration which maximizes ntpd performance — 
single nearby statically allocated Stratum-1 server, static 
and small polling period. 

The problem with ntpd is the sudden performance 
degradations which can occur when conditions deviate 
from ‘ideal’. We have detailed these robustness issues of 
ntpd in prior work, including [17, 18]. Simply put, when 
path delay variability exceeds some threshold, which is a 
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Figure 2: RADclock and ntpd uncorrected performance on dom0, measured using the external comparison with DAG. 
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Figure 3: Error in the DomU dependent clock (dependent 
on the ntpd on Dom0 shown in Figure 2), measured using 
the external comparison with DAG. This 2 hour zoom is 
representative of an experiment 20 days long. 


complex function of parameters, stability of the feedback 
control is lost, resulting in errors which can be large over 
small to very long periods. Recovery from such periods 
is also subject to long convergence times. 

Consider now the pitfalls of using ntpd for timekeep- 
ing in Xen, through the following three scenarios. 


Example 1 - Dependent nitpd clock In a dependent 
clock paradigm, only Dom0 runs a full clock synchro- 
nization algorithm, in this case ntpd. Here we use a 
2.6.26 kernel, the last one supporting ntpd dependent 
timekeeping. 

In the solution detailed in [19], synchronizing times- 
tamps from ntpd are communicated to DomU guests via 
the periodic adjustment of a ‘boot time’ variable in the 
hypervisor. Timestamping in DomU is achieved by a 
modified system clock call which uses Xen Clocksource 
to extrapolate from the last time this variable was up- 
dated forward to the current time. The extrapolation as- 
sumes Xen Clocksource to be exactly 1GHz, resulting 
in a sawtooth shaped clock error which, in the example 
from our testbed given in Figure 3, is in the millisecond 
range. This is despite the fact that the Dom0 clock it 
derives from is the one depicted in Figure 2, which has 
excellent performance in the 10 us range. 

These large errors are ultimately a result of the way 
in which ntpd interacts with the system clock. With no 
ntpd running on DomU (the whole point of the depen- 
dent clock approach), the system clock has no way of in- 
telligently correcting the drift of the underlying counter, 


in this case Xen Clocksource. The fact that Xen Clock- 
source is in reality only very approximately 1GHz means 
that this drift is rapid, indeed appearing to first order as 
a simple skew, that is a constant error in frequency. This 
failure of the dependent clock approach using ntpd has 
led to the alternative solution used today, where each 
guest runs its own independent ntpd daemon. 


Example 2 — Independent nipd clock In an inde- 
pendent clock paradigm, which is used currently in Xen 
timekeeping, each guest OS (both Dom0 and DomU) in- 
dependently runs its own synchronization algorithm, in 
this case ntpd, which connects to its own server using its 
own flow of NTP timing packets. Clearly this solution 
is not ideal in terms of the frugal use of server, network, 
NIC and host resources. In terms of performance, the un- 
derlying issue is that the additional latencies suffered by 
guests in the virtual context make it more likely ntpd will 
be pushed into instability. Important examples of such 
latencies are the descheduling and time-multiplexing of 
guests across physical cores. 


An example is given in Figure 4, where, despite syn- 
chronizing to the same high quality server on the LAN as 
before, stability is lost and errors reach the multiple mil- 
lisecond range. This was brought about simply by adding 
a moderate amount of system load (some light churn of 
DomU guests and some moderate CPU activity on other 
guests), and allowing NTP to select its own polling pe- 
riod (in fact the default configuration), rather than fixing 
it to a constant value. 
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Figure 4: Error in the ntpd independent clock on DomU 
synchronizing to a Stratum-1 server on the LAN, with 
polling period set by ntpd. Additional guests are created 
and destroyed over time. 
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Example 3 — Migrating independent ntpd clock This 
example considers the impact of migration on the syn- 
chronization of the ntpd independent clock (the current 
solution) of a migrating guest. Since migration is treated 
in detail in Section 6, we restrict ourselves here to point- 
ing to Figure 11, where extreme disruption — of the order 
of seconds — is seen following migration events. This is 
not a function of this particular example but is a generic 
result of the design of ntpd in conjunction with the in- 
dependent clock paradigm. A dependent ntpd clock so- 
lution would not exhibit such behavior under migration, 
however it suffers from other problems as detailed above. 


In summary, there are compelling design and robust- 
ness reasons for why ntpd is ill suited to timekeeping in 
virtual environments. The RADclock solution does not 
suffer from any of the drawbacks detailed above. It is 
highly robust to disruptions in general as described for 
example in [20, 15, 16], is naturally suited to a depen- 
dent clock paradigm as detailed in Section 5, as well as 
to migration (Section 6). 


4 Performance of Xen Clocksource 


The Xen Clocksource hybrid counter is a central compo- 
nent of the current timing solution under Xen. In this sec- 
tion we examine its access latency under different condi- 
tions. We also compare it to that of HPET, both because 
HPET is a core component of Xen Clocksource, so this 
enables us to better understand how the latter is perform- 
ing, and because HPET is a good choice of counter, be- 
ing widely available and uninfluenced by power manage- 
ment. This section also provides the detailed background 
necessary for subsequent discussions on network noise. 

Access latency, which impacts directly on timekeep- 
ing, depends on the access mechanism. Since the tim- 
ing architecture we propose in Section 5 is based on a 
feed-forward paradigm, to be of relevance our latency 
measurements must be of access mechanisms that are 
adequate to support feed-forward based synchronization. 
The fundamental requirement is that a counter be de- 
fined, which is cumulative, wide enough to not wrap be- 
tween reboots (we use 64-bit counters which take 585 
years to roll over on a 1GHz processor), and accessible 
from both kernel and user context. The existing ntpd- 
oriented software clock mechanisms do not satisfy these 
conditions. We describe below the alternatives we im- 
plement. 


4.1 Baseline Latencies 


In this section we use the host machine kultarr, a 
2.13GHz Intel Core 2 Duo, and measure access latencies 
by counting the number of elapsed CPU cycles. For this 
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Figure 5: Distribution of counter latencies. Top: TSC 
and HPET in an unvirtualized system using feed-forward 
compatible access; Middle: Xen Clocksource and HPET 
from Dom0; Bottom: from DomU. 


purpose we use the rdtsc() function, a wrapper for the 
x86 RDTSC instruction to read the relevant register(s) 
containing the TSC value and to return it as a 64-bit in- 
teger. This provides direct access to the TSC with very 
low overhead from both user and kernel space. To en- 
sure a unique and reliable TSC, from the BIOS we dis- 
able power management (both P-states and C-states), and 
also disable the second core to avoid any potential failure 
of TSC synchronization across the cores. 


We begin by providing a benchmark result for HPET 
on a non-virtualized system. The top right plot in Fig- 
ure 5 gives its latency histogram measured from within 
kernel context, using the access mechanism described in 
[3]. This mechanism augments the Linux clocksource 
code (which supports a choice among available hardware 
counters), to expose a 64-bit cumulative version of the 
selected counter. For comparison, in the top left plot we 
give the latency of the TSC accessed in the same way — 
it is much smaller as expected, but both counters have 
low variability. Note (see [3]) that the latency of TSC 
accessed directly via rdtsc() is only around 80 cycles, so 
that the feed-forward friendly access mechanism entails 
an overhead of around 240 cycles on this system. 
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Now consider latency in a Xen system. To provide 
the feed-forward compatible access required for each of 
HPET and Xen Clocksource, we first modified the Xen 
Hypervisor (4.0) and the Linux kernel 2.6.31.13 (Xen 
pvops branch) to expose the HPET to Dom0 and DomU. 
We added a new hypercall entry that retrieves the current 
raw platform timer value, HPET in this case. Like Xen 
Clocksource, this is a 64-bit counter which satisfies the 
cumulative requirement. We then added a system call to 
enable each counter to be accessed from user context. Fi- 
nally, for our purposes here we added an additional sys- 
tem call that measures the latency of either HPET or Xen 
Clocksource from within kernel context using rdtsc(). 

The middle row in Figure 5 shows the latency of Xen 
Clocksource and HPET from Dom0’s kernel point of 
view. The Xen Clocksource interpolation mechanism 
adds an extra 200 CPU cycles compared to accessing 
the TSC alone in the manner seen above, for a total of 
250 ns at this CPU frequency. The HPET latency suf- 
fers from the penalty created by the hypercall needed to 
access it, lifting its median value by 800 cycles for a to- 
tal latency of 740 ns. More importantly, we see that the 
hypercall also adds more variability, with an IQR that in- 
creases from 40 to 72 CPU cycles, and the appearance 
of a multi-modal distribution which we speculate arises 
from some form of contention among hypercalls. The 
Xen Clocksource in comparison has an IQR only slightly 
larger than that of the TSC counter. 

The bottom row in Figure 5 shows the latency of Xen 
Clocksource and HPET from DomU’s kernel point of 
view. The Xen Clocksource shows the same performance 
as in the Dom0 case. HPET is affected more, with an in- 
crease in the number of modes and the mass within them, 
resulting in a considerably increased IQR. 

In conclusion, Xen Clocksource performs well despite 
the overhead of its software interpolation scheme. In par- 
ticular, although its latency is almost double that of a 
simple TSC access (and 7 times a native TSC access), it 
does not add a significant latency variability even when 
accessed from DomU. On the other hand however, the 
simple feed-forward compatible way of accessing HPET 
used here is only four times slower than the much more 
complicated Xen Clocksource and is still under 1 us. 
This performance could certainly be improved, for exam- 
ple by replacing the hypercall by a dedicated mechanism 
such as a read-only memory-mapped interface. 


4.2 Impact of Power Management 


Power management is one of the key mechanisms po- 
tentially affecting timekeeping. The Xen Clocksource is 
designed to compensate for its effects in some respects. 
Here we examine its ultimate success in terms of latency. 

In this section we use the host machine sarigue, a 


3GHz Intel Core 2 Duo E8400. Since we do not use 
rdtsc() for latency measurement in this section, we do 
not disable the second core as we did before. Instead we 
measure time differences in seconds, to sub-nanosecond 
precision, using the RADclock difference clock [20] with 
HPET as the underlying counter”. P-states are disabled 
in the BIOS, but C-states are enabled. 

Ideally one would like to directly measure Xen Clock- 
source’s interpolation mechanism and so evaluate it in 
detail. However, the Xen Clocksource recalibrates the 
HPET interpolation on every change in P-State (fre- 
quency changes) or C-State (execution interruption), as 
well as once per second. Since oscillations for exam- 
ple between C-States occur hundreds of times per second 
[8], it is not possible to reliably timestamp these events, 
forcing us to look at coarser characterizations of perfor- 
mance. 

From the timestamping perspective, the main problem 
is the obvious additional latency due to the computer be- 
ing idle in a C-State when an event to timestamp occurs. 
For example, returning from C-State C3 to execution CO 
takes about 20 us [8]. 

For the purpose of synchronization over the network, 
the timestamping of outgoing and incoming synchroniza- 
tion packets is of particular interest. A useful measure of 
these is the Round-Trip-Time (RTT) of a request-reply 
exchange, however since this includes network queuing 
and delays at the time server as well as delays in the host, 
it is of limited use in isolating the latter. 

To observe the host latency we introduce a metric we 
call the RTThost, which is roughly speaking the compo- 
nent of the RTT that lies within the host. More precisely, 
for a given request-reply packet pair, the RTThost is the 
sum of the two one-way delays from the host to the DAG 
card, and from the DAG card to the host. It is not possible 
to reliably measure these one-way delays individually in 
our testbed. However, the RTThost can be reliably mea- 
sured as the difference of the RTT seen by the host and 
that seen by the DAG card. The RTThost is a measure 
of ‘system noise’ with a specific focus on packet times- 
tamping. The smaller RTThost, the less noisy the host, 
and the higher the quality of packet timestamps. 

Figure 6 shows RTThost values measured over 80 
hours on two DomU guests on the same host. The cap- 
ture starts with only the C-State CO enabled, that is with 
power management functions disabled. Over the period 
of the capture, deeper C-States are progressively enabled 
and we observe the impact on RTThost. At each stage 
the CPU moves between the active state CO and the idle 
states enabled at the time. Table 1 gives a breakdown of 


2This level of precision relates to the difference clock itself, when 
measuring time differences of size of the order of 100 us as here. It 
does not take into account the separate issue of timestamping errors, 
such as the (much larger!) counter access latencies studied above. 
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Figure 6: System noise as a function of the deepest enabled C-State for Xen Clocksource (upper time series and left 
box plots) and HPET (lower time series and right box plots). Time series plots have been sampled for clarity. 


time spend in different states. It shows that typically the 
CPU will rest in the deepest allowed C-state unless there 
is a task to perform. 

The left plot in Figure 6 is a compact representation of 
the distribution of RTThost values for Xen Clocksource 
and HPET, for each section of the corresponding time se- 
ries presented on the right of the figure. Here whiskers 
show the minimum and 99th percentile values, the lower 
and upper sides of the box give the 25th and 75th per- 
centiles, while the internal horizontal line marks the me- 
dian. 

The main observation is that, for each counter, RT- 
Thost generally increases with the number of C-States 
enabled, although it is slightly higher for HPET. The in- 
crease in median RTThost from CO to C3 is about 20 us, a 
value consistent with [8]. The minimum value is largely 
unaffected however, consistent with the fact that if a 
packet (which of course is sent when the host is in CO), 
is also received when it is in CO, then it would see the 
RTThost corresponding to CO, even if it went idle in be- 
tween. 

We saw earlier that the access latencies of HPET and 
Xen Clocksource differ by less than 1 us, and so this can- 
not explain the differences in their RTThost median val- 
ues seen here for each given C-State. These are in fact 
due to the slightly different packet processing in the two 
DomU systems. 


























Co Cl C2 C3 
CO enabled | 100% - - - 
Cl enabled | 2.17% | 97.83% - - 
C2 enabled | 2.85% | 0.00% | 97.15% - 
C3 enabled | 2.45% | 0.00% 1.84% | 95.71% 








Table 1: Residency time in different C-States. Here “Cn 
enabled” denotes that all states from CO up to Cn are 
enabled. 
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We conclude that Xen Clocksource, and HPET using 
our proof of concept access mechanism, are affected by 
power management when it comes to details of times- 
tamping latency. These translate into timestamping er- 
rors, which will impact both clock reading and poten- 
tially clock synchronization itself. The final size of 
such errors however is also crucially dependent on the 
asymmetry value associated to RTThost, which is un- 
known. Thus the RTTHost measurements effectively 
place a bound on the system noise affecting timestamp- 
ing, but do not determine it. 


5 New Architecture for Virtualized Clocks 


In this section we examine the performance and detail 
the benefits of the RADclock algorithm in the Xen en- 
vironment, describe important packet timestamping is- 
sues which directly impact clock performance, and fi- 
nally propose a new feed-forward based clock architec- 
ture for para-virtualized systems. 

In Section 5.1 we use sarigue, and in Section 5.2 kul- 
tarr, with the same BIOS and power management set- 
tings described earlier. 


5.1 Independent RADclock Performance 


We begin with a look at the performance of the RAD- 
clock in a Xen environment. Figure 7 shows the final er- 
ror of two independent RADclocks, one using HPET and 
the other Xen Clocksource, running concurrently in two 
different DomU guests. Separate NTP packet streams 
are used to the same Stratum-1 server on the LAN with 
a poll period of 16 seconds. The clock error for each 
clock has been corrected for path asymmetry, in order 
to reveal the underlying performance of the algorithm as 
a delay variability filter (this is possible in our testbed, 
but impossible for the clocks in normal operation). The 
difference of median values between the two clocks is 
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Figure 7: RADclock performance in state CO using Xen Clocksource and HPET, running in parallel, each in a separate 


DomU guest. 


extremely small, and below the detection level of our 
methodology. We conclude that the clocks essentially 
have identical median performance. 

In terms of clock stability, as measured by the IQR of 
the clock errors, the two RADclock instances are again 
extremely similar, which reinforces our earlier observa- 
tions that the difference in stability of the Xen Clock- 
source and HPET is very small (below the level of detec- 
tion in our testbed), and that RADclock works well with 
any appropriate counter. The low frequency oscillation 
present in the time series here is due to the periodic cy- 
cle of the air conditioning system in the machine room, 
and affects both clocks in a similar manner consistent 
with previous results [3]. It is clearly responsible for the 
bulk of the RADclock error in this and other experiments 
shown in this paper. 

Power management is also an important factor that 
may impact performance. Figure 8 shows the distribu- 
tion of clock errors of the RADclock, again using HPET 
and the Xen Clocksource separately but concurrently as 
above, with different C-State levels enabled. In this case 
the median of each distribution has simply been shifted 
to zero to ease the stability (IQR) comparison. For each 
of the C-State levels shown, the stability of the RADclock 
is essentially unaffected by the choice of counter. 

As shown in Figure 6, power management creates ad- 
ditional delays of higher variability when timestamping 
timing packets exchanged with the reference clock. The 
near indifference of the IQR given in Figure 8 to C-State 
shows that the RADclock filtering is robust enough to see 
through this extra noise. 

Power management also has an impact on the asym- 
metry error all synchronization algorithms must face. In 
an excellent example of systematic observation bias, in 
a bidirectional paradigm a packet send by an OS would 
not be delayed by the power management strategy, be- 
cause the OS chooses to enter an idle state only when it 
has nothing to do. On the other hand, over the time in- 
terval defined by the RTT of a time request, it is likely 
the host will choose to stop its execution and enter an 
idle state (perhaps a great many times) and the return- 
ing packet may find the system in such a state. Con- 
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Figure 8: Compact centred distributions of RADclock 
performance as a function of the deepest C-State enabled 
(whiskers give 1st to 99th percentile). 


sequently, only the timestamping of received packets is 
likely to be affected by power management, which trans- 
lates into a bias towards an extra path asymmetry, in the 
sense of ‘most but not all packets’, in the receiving direc- 
tion. This bias is difficult to measure independently and 
authoritatively. The measurement of the RTThost shown 
in Figure 6 gives however a direct estimate of an upper 
bound for it. 


5.2 Sharing the Network Card 


The quality of the timestamping of network packets is 
crucial to the accuracy the synchronization algorithm can 
achieve. The networking in Xen relies on a firewall and 
networking bridge managed by Dom0. In Figure 9 we 
observe the impact of system load on the performance of 
this mechanism. 

The top plot shows the RTThost time series, as seen 
from the Dom0 perspective, as we add more DomU 
guests to the host. Starting with Dom0 only, we add an 
additional guest every 12 hours. None of the OSs run 
any CPU or networking intensive tasks. The middle plot 
gives the box plots of the time series above, where the 
increase in median and IQR values is more clearly seen. 
For reference the ‘native’ RTThost of a non-virtualized 
system is also plotted. The jump from this distribution 
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to the one labeled ‘Dom0’ represents the cost of the net- 
working bridge implemented in Xen. 


The last plot in figure 9 shows the distribution of RT- 
Thost values from each guest’s perspective. All guests 
have much worse performance than Dom0, but perfor- 
mance degrades by a similar amount as Dom0 as a func- 
tion of the number of guests. For a given guest load 
level, the performance of each guest clock seems essen- 
tially the same, though with small systematic differences 
which may point to scheduling policies. 


The observations above call for the design of a 
timestamping system under a dependent clock paradigm 
where Dom0 has an even higher priority in terms of net- 
working, so that it can optimize its timestamping qual- 
ity and thereby minimize the error in the central Dom0 
clock, to the benefit of all clocks on the system. Fur- 
ther, DomU packet timestamping should be designed to 
minimize any differences between DomU guests, and re- 
duce as much as possible the difference in host asym- 
metry between Dom0 and DomU guests, to help make 
the timestamping performance across the whole system 
more uniform. 
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Figure 9: kultarr: RTThost (a.k.a. system noise) as a 
function of the number of active guests. Top: RTThost 
timeseries seen by Dom0; Middle: corresponding dis- 
tribution summaries (with native non-Xen case added on 
the left for comparison); Bottom: as seen by each DomU. 
Whiskers show the minimum and 99th percentile. 
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5.3. A Feed-Forward Architecture 


As described in Section 2.5, the feed-forward approach 
used by the RADclock has the advantage of cleanly 
separating timestamping (performed as a raw times- 
tamp in the kernel or user space as needed), which is 
stateless, and the clock synchronization algorithm itself, 
which operates asynchronously in user space. The algo- 
rithm updates clock parameters and makes them avail- 
able through the OS, where any authorized clock reading 
function (a kind of almost trivial stateless “system clock’) 
can pick them up and use them either to compose an ab- 
solute timestamp, or robustly calculate a time difference 
[20]. 

The RADclock is then naturally suited for the depen- 
dent clock paradigm and can be implemented in Xen as a 
simple read/write stateless operation using the XenStore, 
a file system that can be used as an inter-OS communica- 
tion channel. After processing synchronization informa- 
tion received from its time server, the RADclock running 
on Dom0 writes its new clock parameters to the Xen- 
Store. On DomU, a process reads the updated clock pa- 
rameters upon request and serves them to any application 
that needs to timestamp events. The application times- 
tamps the event(s) of interest. These raw timestamps can 
then be easily converted either into a wallclock time or 
a time difference measured in seconds (this can even be 
done later off-line). 

Unlike with ntpd and its coupled relationship to the 
(non-trivial) incumbent system clock code, no adjust- 
ment is passed to another dynamic mechanism, which 
ensures that only a single clock, clearly defined in a sin- 
gle module, provides universal time across Dom0 and all 
DomU guests. 


With the above architecture, there is only one way in 
which a guest clock can not be strictly identical with the 
central Dom0 clock. The read/write operation on the 
XenStore is not instantaneous and it is possible that the 
update of clock parameters, which is slightly delayed af- 
ter the processing of a new synchronization input to the 
RADclock, will result in different parameters being used 
to timestamp some event. In other words, the time across 
OSs may appear different for a short time if a timestamp- 
ing function in a DomU converts a raw timestamp with 
outdated data. However, this is a minor issue since clock 
parameters change slowly, and using out of date values 
has the same impact as the synchronization input simply 
being lost, to which the clock is already robust. 

In Figure 10 we measured the time required to write to 
the XenStore using the RADclock difference clock which 
has an accuracy well below | us [20]. We present results 
obtained on 2 host machines with slightly different hard- 
ware architectures, namely kultarr (2.13 GHz Intel Core 
2 Duo) and tastiger (3.40 GHz Intel Pentium D), that 
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Figure 10: Distribution of clock update latency through 
the xenstore, tastiger (left, Pentium D, 3.4GHz) and kul- 
tarr (right, Core 2 Duo, 2.13GHz). 


show respective median delays of 1.2 and 1.4ms. As- 
suming a 16s poll period, this corresponds to 1 chance 
out of 11,500 that the clocks would (potentially) disagree 
if read at some random time. 

The dependent RADclock is ideally suited for time 
keeping on Xen DomU. It is a simple, stateless, standard 
read/write operation that is robust as it avoids the danger- 
ous dynamics of feedback approaches, ensures that the 
clocks of all guests agree, and is robust to system load 
and power management effects. As a dependent clock 
solution, it saves both host and network resources and 
is inherently scalable. Thanks to a simple timestamping 
function it provides the same level of final timekeeping 
accuracy to all OSs. 


6 A Migration-Friendly Architecture 


Seamlessly migrating a running system from one phys- 
ical machine to another is a key innovation of virtu- 
alization [13, 7]. However this operation becomes far 
from seamless with respect to timing when using ntpd. 
As mentioned in Section 3, ntpd’s design requires each 
DomU to run its own instance of the ntpd daemon, which 
is fundamentally unsuited to migration, as we now ex- 
plain. 

The synchronization algorithm embodied in the ntpd 
daemon is stateful. In particular it maintains a time vary- 
ing estimate of the Xen Clocksource’s rate-of-drift and 
current clock error, which in turn is defined by the char- 
acteristics of the oscillator driving the platform counter. 
After migration, the characteristics seen by ntpd change 
dramatically since no two oscillators drift in the same 
way. Although the Xen Clocksource counters on each 
machine nominally share the same frequency (1GHz), in 
practice this is only true very approximately. The tem- 
perature environment of the machine DomU migrates to 
can be very different from the previous one which can 
have a large impact, but even worse, the platform timer 
may be of a different nature, HPET originally and ACPI 


after migration for example. Furthermore, ntpd will also 
inevitably suffer from an inability to account for the time 
during which DomU has been halted during the migra- 
tion. When DomU restarts, the reference wallclock time 
and last Xen Clocksource value maintained by its system 
clock will be quite inconsistent with the new ones, lead- 
ing to extreme oscillator rate estimates. In summary, the 
sudden change in status of ntpa’s state information, from 
valid to almost arbitrary, will, at best, deliver a huge error 
immediately after migration, which we expect to decay 
only slowly according to ntpd’s usual slow convergence. 
At worst, the ‘shock’ of migration may push nfpd into an 
unstable regime from which it may never recover. 

In contrast, by decomposing the time information into 
raw timestamps and clock parameters, as described in 
Section 5, the RADclock allows the daemon running on 
DomU to be stateless within an efficient dependent clock 
strategy. The migration then becomes trivial from a time- 
keeping point of view. Once migrated, DomU times- 
tamps events of interests with its chosen counter and re- 
trieves the RADclock clock parameters maintained by the 
new Dom0 to convert them into absolute time. DomU 
immediately benefits from the accuracy of the dedicated 
RADclock running on Dom0 — the convergence time is 
effectively zero. 

The plots in Figure 11 confirm the claims above and il- 
lustrate a number of important points. In this experiment, 
each of tastiger and kultarr run an independent RAD- 
clock in Dom0. The clock error for these is remarkably 
similar, with an IQR below 10 us as seen in the top plot 
(measured using the DAG external comparison). Here 
for clarity the error time series for the two Dom0 clocks 
have been corrected for asymmetry error, thereby allow- 
ing their almost zero inherent median error, and almost 
identical behavior (the air-conditioning generated oscil- 
lations overlay almost perfectly), to be clearly seen. 

For the migration experiment, a single DomU OS is 
started on tastiger, and two clocks launched on it: a de- 
pendent RADclock, and an independent ntpd clock. A 
few hours of warm up are then given (not shown) to al- 
low nipd to fully converge. The experiment proper then 
begins. At the 30 minute mark DomU is migrated to kul- 
tarr, it migrates back to tastiger after 2 hours then back 
again after another 2, followed by further migrations with 
a smaller period of 30 minutes. 

The resulting errors of the two migrating DomU 
clocks are shown in the top plot, and in a zoomed out 
version in the middle plot, as measured using the ex- 
ternal comparison. Before the results, a methodologi- 
cal point. The dependent RADclock running on DomU 
is by construction identical to the RADclock running on 
Dom0, and so the two time series (if asymmetry cor- 
rected) would superimpose almost perfectly, with small 
differences owing to the different errors in the times- 
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Figure 11: Clock errors under migration. Top: asymmetry corrected unmigrated RADclock Dom0 clocks, and (uncor- 
rected) migrated clocks on DomU; Middle: zoom out on top plot revealing the huge size of the migration ‘shock’ on 
ntpd; Bottom: effect of migration load on Dom0 clocks on kultarr. 


tamping of the separate UDP packet streams. We choose 
however, in the interests of fairness and simplicity of 
comparison, not to apply the asymmetry correction in 
this case, since it is not possible to apply an analogous 
correction to the ntpd error time series. As a substi- 
tute, we instead draw horizontal lines over the migrating 
RADclock time series representing the correction which 
would have been applied. No such lines can be drawn in 
the ntpd case. 


Now to the results. As expected, and from the very 
first migration, ntpd exhibits extremely large errors (from 
-1 to 27s!) for periods exceeding 15 minutes (see zoom 
in middle plot) and needs at least another hour to con- 
verge to a reasonable error level. The dependent RAD- 
clock on the other hand shows seamless performance 
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with respect to the horizontal lines representing the ex- 
pected jumps due to asymmetry changes as just de- 
scribed. These jumps are in any case small, of the order 
of a few microseconds. Note that these corrections are 
a function both of RTThost and asymmetry that are both 
different between tastiger and kultarr. 


Finally, we present a load test comparison. The bottom 
plot in Figure 11 compares in detail the performance of 
the independent RADclock running on Dom0 on kultarr, 
and an independent ntpd clock, also running on Dom0 
during the experiment (not shown previously). Whereas 
the RADclock is barely affected by the changes in net- 
work traffic and system load associated with the migra- 
tions of the DomU guest, ntpd shows significant devi- 
ation. In summary, not only is ntpd in an independent 
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clock paradigm incompatible with clock migration, it is 
also, regardless of paradigm, affected by migration oc- 
curring around it. 

One could also consider the performance of an inde- 
pendent RADclock paradigm under migration. However, 
we expect that the associated ‘migration shock’ would be 
severe as the RADclock is not designed to accommodate 
radical changes in the underlying counter. Since the de- 
pendent solution is clearly superior from this and many 
other points of view, we do not present results for the 
independent case under migration. 


7 Conclusion 


Virtualization of operating systems and accurate com- 
puter based timing are two areas set to increase in im- 
portance in the future. Using Xen para-virtualization as 
a concrete framework, we highlighted the weaknesses 
of the existing timing solution, which uses indepen- 
dent ntpd synchronization algorithms (coupled to state- 
ful software clock code) for each guest operating system. 
In particular, we showed that this solution is fundamen- 
tally unsuitable for the important problem of live VM 
migration, using both arguments founded on the design 
of ntpd, as well as detailed experiments in a hardware- 
validated testbed. 

We reviewed the architecture of the RADclock algo- 
rithm, in particular its underlying feed-forward basis, the 
clean separation between its timestamping and synchro- 
nization aspects, and its high robustness to network and 
system noise (latency variability). We argued that these 
features make it ideal as a dependent clock solution, par- 
ticularly since the clock is already set up to be read 
through combining a raw hardware counter timestamp 
with clock parameters sourced from a central algorithm 
which owns all the synchronization intelligence, via a 
commonly accessible data structure. We supported our 
claims by detailed experiments and side-by-side com- 
parisons with the status quo. For the same reasons, the 
RADclock approach enables seamless and simple migra- 
tion, which we also demonstrated in benchmarked ex- 
periments. The enabling of a dependent clock approach 
entails considerable scalability advantages and suggests 
further improvements through optimizing the timestamp- 
ing performance of the central clock in Dom0. 

As part of an examination of timestamping and 
counter suitability for timekeeping in general and the 
feed-forward paradigm in particular, we provided a de- 
tailed evaluation of the latency and accuracy of the Xen 
Clocksource counter, and compared it to HPET. We con- 
cluded that it works well as intended, however note that 
it is a complex solution created to solve a problem which 
will soon disappear as reliable TSC counters again be- 
come ubiquitous. The RADclock is suitable for use with 


any counter satisfying basic properties, and we showed 
its performance using HPET or Xen Clocksource was in- 
distinguishable. 

The RADclock [14] packages for Linux now support a 
streamlined version of the architecture for Xen described 
here using Xen Clocksource as the hardware counter. 
With the special code allowing system instrumentation 
and HPET access removed, no modifications to the hy- 
pervisor are finally required. 
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9 Availability 


RADclock packages for Linux and _ FreeBSD, 
software and papers, can be found at 
http://www.cubinlab.ee.unimelb.edu.au/radclock/. 


References 


[1] Xen.org History. http://www.xen.org/community/ 
xenhistory.html. 


[2] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., 
HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND 
WARFIELD, A. Xen and the Art of Virtualization. In SOSP ’03: 
Proceedings of the nineteenth ACM symposium on Operating sys- 
tems principles (New York, NY, USA, 2003), ACM, pp. 164-177. 


BROOMHEAD, T., RIDOUX, J., AND VEITCH, D. Counter 
Availability and Characteristics for Feed-forward Based Synchro- 
nization. In Int. IEEE Symp. Precision Clock Synchronization for 
Measurement, Control and Communication (ISPCS’09) (Brescia, 
Italy, Oct. 12-16 2009), IEEE Piscataway, pp. 29-34. 


ENDACE. Endace Measurement Systems. DAG series PCI and 
PCI-X cards. http://www.endace.com/networkMCards.htm. 


[3 


oS 


[4 


& 


[S 


= 


INTEL CORPORATION. JA-PC HPET (High Precision Event 
Timers) Specification (revision 1.0a). http://www.intel. 
com/hardwaredesign/hpetspec_1.pdf, Oct. 2004. 


[6 


So 


KAMP, P. H. Timecounters: Efficient and precise timekeeping 
in SMP kernels. In Proceedings of the BSDCon Europe 2002 
(Amsterdam, The Netherlands, 15-17 November 2002). 


KEIR, C. C., CLARK, C., FRASER, K., H, S., HANSEN, 
J. G., JUL, E., LIMPACH, C., PRATT, I., AND WARFIELD, A. 
Live Migration of Virtual Machines. In Proceedings of the 2nd 
ACM/USENIX Symposium on Networked Systems Design and Im- 
plementation (NSDI) (2005), pp. 273-286. 


KIDD, T. Intel Software Network Blogs. http: //software. 
intel.com/en-us/blogs/author/taylor-kidd/. 


[7 


oS 


[8 


oo 


[9 


= 


MENON, A., SANTOS, J. R., TURNER, Y., JANAKIRAMAN, 
G. J., AND ZWAENEPOEL, W. Diagnosing performance over- 
heads in the xen virtual machine environment. In VEE ’05: 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10) 463 


464 


[10] 


(11) 


[12] 


[13] 


[14] 
(15) 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


Proceedings of the Ist ACM/USENIX international conference 
on Virtual execution environments (New York, NY, USA, 2005), 
ACM, pp. 13-23. 


MICROSOFT CORPORATION. Guidelines For Providing Multi- 
media Timer Support. Tech. rep., Microsoft Corporation, Sep. 
2002. http://www.microsoft.com/whdc/system/ 
sysinternals/mm-timer.mspx. 


MILLS, D. L. Computer Network Time Synchronization: The 
Network Time Protocol. CRC Press, Inc., Boca Raton, FL, USA, 
2006. 


MOGUL, J., MILLS, D., BRITTENSON, J., STONE, J., AND 
WINDL, U. Pulse-Per-Second API for UNIX-like Operating Sys- 
tems, Version 1.0. Tech. rep., IETF, 2000. 


NELSON, M., HONG LIM, B., AND HUTCHINS, G._ Fast 
transparent migration for virtual machines. In Proceedings of 
the annual conference on USENIX Annual Technical Conference 
(2005), USENIX Association. 


RIDOUX, J., AND VEITCH, D. RADclock Project webpage. 


RIDOUX, J., AND VEITCH, D. A Methodology for Clock Bench- 
marking. In Tridentcom (Orlando, FL, USA, May 21-23 2007), 
IEEE Comp. Soc. 


RIDOUX, J., AND VEITCH, D. The Cost of Variability. In Int. 
IEEE Symp. Precision Clock Synchronization for Measurement, 
Control and Communication (ISPCS’08) (Ann Arbor, Michigan, 
USA, Sep. 24-26 2008), pp. 29-32. 


RIDOUX, J., AND VEITCH, D. Ten Microseconds Over LAN, for 
Free (Extended). JEEE Trans. Instrumentation and Measurement 
(TIM) 58, 6 (June 2009), 1841-1848. 


RIDOUX, J., AND VEITCH, D. Principles of Robust Timing Over 
the Internet. ACM Queue, Communications of the ACM 53, 5 
(May 2010), 54-61. 


THE XEN TEAM. Xen Documentation. http: //www.xen. 
org/files/xen_interface.pdf. 


VEITCH, D., RIDOUX, J., AND KORADA, S. B. Robust Syn- 
chronization of Absolute and Difference Clocks over Networks. 
IEEE/ACM Transactions on Networking 17, 2 (April 2009), 417— 
430. 


VMWARE. Timekeeping in VMware Virtual Machines. Tech. 
rep., VMware, May 2010. http://www. vmware.com/ 
files/pdf/Timekeeping-In-VirtualMachines. 
pdf. 


9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10) 


USENIX Association 


